Sunday, May 6, 2018

Wavenet Paper: impressions

So, I'm going to start posting my impressions of some deep-learning papers I've read, mostly as a way of encouraging myself to do more (because, you know, if it doesn't exist online, it didn't happen!), but also to provide a place for me to make notes and ask questions.  So here goes...

Recently read the paper WAVENET- A Generative Model for Raw Audio, by Aaron van den Oord et al (2016-09-19) - you can find a webpage describing it at a higher level, with results, here.

Quick summary - they have come up with a new method for generating convincing human speech, given an input text, that matches a given speaker identity... and the results are really impressive!  I, for one, don't know if I would be able to distinguish it from actual recorded human speech!

Of course, I should note, based on the samples, the training datasets seem to mostly consist of professional voice actors reading text samples in... well, as somewhat flat, professional sort of way... like they were going to be used to give google-maps driving directions, or something similar.  Words are all clearly enunciated, and sentences seem to have pretty well-delineated endings.  It's possible that it might be harder to emulate, say, conversational speech - which is a lot more free flowing - or speech with a more dramatic intent.  I suspect the algorithm would struggle to apply the right emotional cues to the correct situations, for instance, or might stray more from believability the longer it had to generate a continuous sample for.

Still - what it does is VERY impressive... it adds inflection and emphasis at believable places, and even adds things like audible breaths.  And the model is general enough that they were also able to use it to generate music - ie, "original" piano compositions, for instance.  Here we start to see some of the limitations of the technique: they manage to convincingly sound like a piano being played, but fail to give the impression of a larger composition.  They're a bit like a conversation with someone with severe short term memory problems, or like a dream: there's continuity / cohesion on short time scales, but if you take a step back and look at it on a larger time scale, there's large shifts and not much overall consistency.

Anyway - as for the nitty-gritty in the paper itself: the basic idea is they they use a 1-D convolutional network... with the restriction that the convolutional filter only extends backwards in time. This way, it can be used to generate new audio, one sample at a time, by generating a new sample, then shifting the convolutional input point forward one sample (to use your newly-generated sample), reapplying to generate the next sample, etc.  It's a bit like one of my favorite scenes from Wallace and Gromit, where he's riding a train, and laying down the track before him as he goes:




They call this a "causal" convolution... which makes sense, unless your poor brain for some reason keeps reading that as "casual", and you spend half the paper wondering why they think their approach is so informal...not that I would do that... 😙♫

In order to increase the "receptive area" of a node, without adding too many layers, they use a technique called "dilation," which I initially mistook as just a fancy term for "stride," but there's a key difference.  To explain the difference, they show this image for a non-dilated convolutional network:



...and this one for a dilated network:

Now, on viewing the second, I thought, "Ah, that's just convolutional network, with a stride of 2!"  However, on re-reading, I noticed this line (emphasis mine):

This is similar to pooling or strided convolutions, but here the output has the same size as the input.
I had thought that all the orange circles besides the far-right most one, and all the dotted-lines connected to them, represented the graphs we would get, if we were generating a different time sample.  All the circles that didn't have a bold arrow connecting to them, then, were not nodes that actually existed in the evaluation of the layer, but were only shown to illustrate their "place" if we were evaluating at a different time step.  This meant that the number of nodes in the layer went from 16 to 8 to 4 to 2 to 1... a classic "stride-by-2" layout.  This is incorrect - ALL the outputs are used when evaluating at the current time step, and all nodes in all the layers exist, for this time-step... which means the "dilation" really is "skipping N nodes".

This is an interesting approach, compared with classic strides... basically, it's a way of getting the receptive-area-expansion effect of strides, but without the down-rezzing.  The downside, obviously, is that it makes for a much larger / complicated graph, which presumably takes longer to train, etc.  Still... assuming I can make graphs using this technique that will fit into my graphics card's memory, it would be interesting to experiment with using this instead of strides in some 2D pixel networks (ie, style-transfer); I feel the commonly-used strides are likely introducing some boxy artifacts, and perhaps this approach will help alleviate that.