Sunday, May 6, 2018

Wavenet Paper: impressions

So, I'm going to start posting my impressions of some deep-learning papers I've read, mostly as a way of encouraging myself to do more (because, you know, if it doesn't exist online, it didn't happen!), but also to provide a place for me to make notes and ask questions.  So here goes...

Recently read the paper WAVENET- A Generative Model for Raw Audio, by Aaron van den Oord et al (2016-09-19) - you can find a webpage describing it at a higher level, with results, here.

Quick summary - they have come up with a new method for generating convincing human speech, given an input text, that matches a given speaker identity... and the results are really impressive!  I, for one, don't know if I would be able to distinguish it from actual recorded human speech!

Of course, I should note, based on the samples, the training datasets seem to mostly consist of professional voice actors reading text samples in... well, as somewhat flat, professional sort of way... like they were going to be used to give google-maps driving directions, or something similar.  Words are all clearly enunciated, and sentences seem to have pretty well-delineated endings.  It's possible that it might be harder to emulate, say, conversational speech - which is a lot more free flowing - or speech with a more dramatic intent.  I suspect the algorithm would struggle to apply the right emotional cues to the correct situations, for instance, or might stray more from believability the longer it had to generate a continuous sample for.

Still - what it does is VERY impressive... it adds inflection and emphasis at believable places, and even adds things like audible breaths.  And the model is general enough that they were also able to use it to generate music - ie, "original" piano compositions, for instance.  Here we start to see some of the limitations of the technique: they manage to convincingly sound like a piano being played, but fail to give the impression of a larger composition.  They're a bit like a conversation with someone with severe short term memory problems, or like a dream: there's continuity / cohesion on short time scales, but if you take a step back and look at it on a larger time scale, there's large shifts and not much overall consistency.

Anyway - as for the nitty-gritty in the paper itself: the basic idea is they they use a 1-D convolutional network... with the restriction that the convolutional filter only extends backwards in time. This way, it can be used to generate new audio, one sample at a time, by generating a new sample, then shifting the convolutional input point forward one sample (to use your newly-generated sample), reapplying to generate the next sample, etc.  It's a bit like one of my favorite scenes from Wallace and Gromit, where he's riding a train, and laying down the track before him as he goes:




They call this a "causal" convolution... which makes sense, unless your poor brain for some reason keeps reading that as "casual", and you spend half the paper wondering why they think their approach is so informal...not that I would do that... 😙♫

In order to increase the "receptive area" of a node, without adding too many layers, they use a technique called "dilation," which I initially mistook as just a fancy term for "stride," but there's a key difference.  To explain the difference, they show this image for a non-dilated convolutional network:



...and this one for a dilated network:

Now, on viewing the second, I thought, "Ah, that's just convolutional network, with a stride of 2!"  However, on re-reading, I noticed this line (emphasis mine):

This is similar to pooling or strided convolutions, but here the output has the same size as the input.
I had thought that all the orange circles besides the far-right most one, and all the dotted-lines connected to them, represented the graphs we would get, if we were generating a different time sample.  All the circles that didn't have a bold arrow connecting to them, then, were not nodes that actually existed in the evaluation of the layer, but were only shown to illustrate their "place" if we were evaluating at a different time step.  This meant that the number of nodes in the layer went from 16 to 8 to 4 to 2 to 1... a classic "stride-by-2" layout.  This is incorrect - ALL the outputs are used when evaluating at the current time step, and all nodes in all the layers exist, for this time-step... which means the "dilation" really is "skipping N nodes".

This is an interesting approach, compared with classic strides... basically, it's a way of getting the receptive-area-expansion effect of strides, but without the down-rezzing.  The downside, obviously, is that it makes for a much larger / complicated graph, which presumably takes longer to train, etc.  Still... assuming I can make graphs using this technique that will fit into my graphics card's memory, it would be interesting to experiment with using this instead of strides in some 2D pixel networks (ie, style-transfer); I feel the commonly-used strides are likely introducing some boxy artifacts, and perhaps this approach will help alleviate that.

Wednesday, April 4, 2018

L2 Regularization and neural network "simplicity"

So, this is related to the topic of my last blog post, http://neuralnetworksanddeeplearning.com - and I was initially going to bundle this in there, but it's kinda lengthy, and I didn't want to dilute my whole-hearted recommendation of the book... what follows really is something that started as nit-picking, but led to what (I think) was a better understanding of how L2 regularization works... and it by no means dampers my enthusiasm for the book!

However... for some reason, when reading the book, there was one section that made me pause, and the more I thought about it, the more I came to a different conclusion than the author. It was, as foreshadowed by the title, the section on L2 Regularization and neural network "simplicity".

In it, he essentially makes the claim that L2 Regularization results in a simpler "model".  He spends a fair amount of time discussing simplicity in a larger sense, and examples where simpler explanations are or are not more correct... but never really makes a convincing argument for why the smaller-weight models favored by L2 regularization should be considered "simpler".

He DOES make some good points about why it might favor more generalized models, vs just memorizing noise... and then implies that this therefore makes it "simpler".  The main justification here is an analogy to a situation where you have some noisy data, and can use either a linear approximation or a polynomial fitting. Intuitively, the linear model is both simpler AND more generalized, but I don't know that the two things - simplicity and generalization - always go hand-in-hand, and I would argue that in the case of neural networks and L2 regularization, they don't.

To see why, let's consider one of the ways in which his linear vs polynomial comparison differs from our regularized vs. unregularized comparison: number of variables. His polynomial model essentially has 10 different variables, while his linear model only has one, slope (or two, if you consider offset, though in his pictured example it's 0).  So, another way of looking at it might be to call the simpler network the one with fewer variables.

Ah, you say, what relevance does that have to our regularized vs. unregularized comparison? Don't both of those have the same number of parameters? And, technically, yes, that's true... but consider this: regularization is something that helps a network perform better when it's overfitting... that is, when it's number of parameters is relatively large compared to the number inputs we're training over. So, say we have a situation where regularization is helping; in that case, it's likely that if we take the unregularized version, and simply increase the size of the network (but keep the training set size the same), we'll see relatively small increases in real-world performance... but if we do the same with the regularized one, we might expect to see a bigger impact.  That implies that regularized networks are making "better use of" their parameters... that is, that even though they technically have the same number of parameters, the regularized one has more "useless" parameters... and I think that's exactly what's happening.

To see why I think L2 regularization helps avoid "useless" parameters, let's take things down to more concrete terms: on the most basic level, if we have a set of 4 weights, then given two distributions of weights, A and B:

A = [.03, .9, .02, .05]
B = [.3, .4, .1, .2]

...then regularization will strongly favor B over A. But without any context, but simply looking at the weights, I think most people would say that A is simpler than B - A is effectively saying "the second input is so much more important than the other inputs, we we can effectively ignore the rest of them" - that sounds a whole lot simpler than the approach B is taking, which is to effectively say, "while the second IS more important, it's still important to consider all the others as well!"  Without regularization, there's nothing to prevent this from going to extremes, as long as it happens to fit better to the training data - ie, A (the unregularized) result might end up looking like [.0001, .99999, 1e-10, .001] - which can be pretty much modeled by a 1-parameter system, and is fairly "simple" - even though B might only give a 2% worse result on the training data, and still uses all 4 parameters.

To put things in a different perspective - let's look at the handwriting-recognition problem. Say we happen to notice that ALL the "9"s in our training sample have a value > .5 in a given pixel. As time goes on, without regularization, our neural network will tend to HEAVILY weight the input from that pixel when deciding if something is a 9, which effectively ends up decreasing the importance of other pixels, or larger patterns.  The regularized approach, on the other hand, will sort of be saying that, "ok, even though that one pixel seems more important on this data set, I don't want to forget the contributions of all the other pixels" - so that, when we feed the network a 9 that is < .5 in that pixel, it is able to cope with that better.  This is a more nuanced approach, and to my mind at least, more complex.

Finally, I would argue that, for most problems we want to use machine learning for, Occam's razor is reversed - the simpler solution is LESS likely to be correct! Indeed, the whole field of machine learning can be thought of having been birthed by the desire to find more complex solutions - ie, for dealing with problems for which we can't find any simple models to deal with.  The problems are so complex, that intuitively, I'm likely to think that the that more correct model is also likely more complex*... so, since regularized models tend to give better results for these problems, I'm more inclined to believe they're more complex!

Now, I know that a lot these arguments are pretty complex and hand-wavey... but to me, they feel closer to the truth of what's happening here... and, I suppose, the real point of all this was that I think it gave me a better intuition on how L2 regularization is likely working!

*I think this heuristic - that machine-learning problems are so complex that the more correct model is also likely more complex - will often hold because the "ground" truth for many of these problems is what a human would say - ie, our basis for comparison is the model mapped in the neurons in our brains, which are incredibly complex.  Of course, there are counter examples - handwriten digit recognition is largely solved, for instance, with relatively small networks, so the heuristic sort of fails here.  But the standard NIST handwriting recognition problem is also one with a lot of constraints and preconditions, which make it a lot easier to solve - we're presupposing that the images we're fed ARE digits, they're frequently segmented already, we're only considering digits (and not letters, and capital letters, and punctuation), we don't have to find them within larger images, etc.  The more of those preconditions are eliminated, the closer they get to the tasks our brains are actually doing, and the more complex the problem gets... and the more I will believe that a more correct network is more complex.

Neural Networks and Deep Learning

So, the title of the post serves two purposes - one, to serve notice to this blog (hah - as though anyone reads this!) that these are topics that I've recently become very interested in, and will likely be posting about a lot, and two, to let everyone know about an awesome online book of the same name, http://neuralnetworksanddeeplearning.com

It's a REALLY great resource for people looking to get started with neural networks. Of course, everyone learns in different ways, so I should clarify that I'm someone who likes to get a good mix of practical knowledge and theoretical underpinnings... but if that sounds like you, then I can't recommend this book highly enough. In a relatively short amount of text, he gives a broad enough overview of the field that I really felt I could start diving into the topic - reading papers, and tinkering with code - while still managing to go in depth enough into his topics that I felt I had a decent understanding of how (or why) they worked. It's a rare feat... kudos, Michael Nielsen!