Wednesday, April 4, 2018

L2 Regularization and neural network "simplicity"

So, this is related to the topic of my last blog post, http://neuralnetworksanddeeplearning.com - and I was initially going to bundle this in there, but it's kinda lengthy, and I didn't want to dilute my whole-hearted recommendation of the book... what follows really is something that started as nit-picking, but led to what (I think) was a better understanding of how L2 regularization works... and it by no means dampers my enthusiasm for the book!

However... for some reason, when reading the book, there was one section that made me pause, and the more I thought about it, the more I came to a different conclusion than the author. It was, as foreshadowed by the title, the section on L2 Regularization and neural network "simplicity".

In it, he essentially makes the claim that L2 Regularization results in a simpler "model".  He spends a fair amount of time discussing simplicity in a larger sense, and examples where simpler explanations are or are not more correct... but never really makes a convincing argument for why the smaller-weight models favored by L2 regularization should be considered "simpler".

He DOES make some good points about why it might favor more generalized models, vs just memorizing noise... and then implies that this therefore makes it "simpler".  The main justification here is an analogy to a situation where you have some noisy data, and can use either a linear approximation or a polynomial fitting. Intuitively, the linear model is both simpler AND more generalized, but I don't know that the two things - simplicity and generalization - always go hand-in-hand, and I would argue that in the case of neural networks and L2 regularization, they don't.

To see why, let's consider one of the ways in which his linear vs polynomial comparison differs from our regularized vs. unregularized comparison: number of variables. His polynomial model essentially has 10 different variables, while his linear model only has one, slope (or two, if you consider offset, though in his pictured example it's 0).  So, another way of looking at it might be to call the simpler network the one with fewer variables.

Ah, you say, what relevance does that have to our regularized vs. unregularized comparison? Don't both of those have the same number of parameters? And, technically, yes, that's true... but consider this: regularization is something that helps a network perform better when it's overfitting... that is, when it's number of parameters is relatively large compared to the number inputs we're training over. So, say we have a situation where regularization is helping; in that case, it's likely that if we take the unregularized version, and simply increase the size of the network (but keep the training set size the same), we'll see relatively small increases in real-world performance... but if we do the same with the regularized one, we might expect to see a bigger impact.  That implies that regularized networks are making "better use of" their parameters... that is, that even though they technically have the same number of parameters, the regularized one has more "useless" parameters... and I think that's exactly what's happening.

To see why I think L2 regularization helps avoid "useless" parameters, let's take things down to more concrete terms: on the most basic level, if we have a set of 4 weights, then given two distributions of weights, A and B:

A = [.03, .9, .02, .05]
B = [.3, .4, .1, .2]

...then regularization will strongly favor B over A. But without any context, but simply looking at the weights, I think most people would say that A is simpler than B - A is effectively saying "the second input is so much more important than the other inputs, we we can effectively ignore the rest of them" - that sounds a whole lot simpler than the approach B is taking, which is to effectively say, "while the second IS more important, it's still important to consider all the others as well!"  Without regularization, there's nothing to prevent this from going to extremes, as long as it happens to fit better to the training data - ie, A (the unregularized) result might end up looking like [.0001, .99999, 1e-10, .001] - which can be pretty much modeled by a 1-parameter system, and is fairly "simple" - even though B might only give a 2% worse result on the training data, and still uses all 4 parameters.

To put things in a different perspective - let's look at the handwriting-recognition problem. Say we happen to notice that ALL the "9"s in our training sample have a value > .5 in a given pixel. As time goes on, without regularization, our neural network will tend to HEAVILY weight the input from that pixel when deciding if something is a 9, which effectively ends up decreasing the importance of other pixels, or larger patterns.  The regularized approach, on the other hand, will sort of be saying that, "ok, even though that one pixel seems more important on this data set, I don't want to forget the contributions of all the other pixels" - so that, when we feed the network a 9 that is < .5 in that pixel, it is able to cope with that better.  This is a more nuanced approach, and to my mind at least, more complex.

Finally, I would argue that, for most problems we want to use machine learning for, Occam's razor is reversed - the simpler solution is LESS likely to be correct! Indeed, the whole field of machine learning can be thought of having been birthed by the desire to find more complex solutions - ie, for dealing with problems for which we can't find any simple models to deal with.  The problems are so complex, that intuitively, I'm likely to think that the that more correct model is also likely more complex*... so, since regularized models tend to give better results for these problems, I'm more inclined to believe they're more complex!

Now, I know that a lot these arguments are pretty complex and hand-wavey... but to me, they feel closer to the truth of what's happening here... and, I suppose, the real point of all this was that I think it gave me a better intuition on how L2 regularization is likely working!

*I think this heuristic - that machine-learning problems are so complex that the more correct model is also likely more complex - will often hold because the "ground" truth for many of these problems is what a human would say - ie, our basis for comparison is the model mapped in the neurons in our brains, which are incredibly complex.  Of course, there are counter examples - handwriten digit recognition is largely solved, for instance, with relatively small networks, so the heuristic sort of fails here.  But the standard NIST handwriting recognition problem is also one with a lot of constraints and preconditions, which make it a lot easier to solve - we're presupposing that the images we're fed ARE digits, they're frequently segmented already, we're only considering digits (and not letters, and capital letters, and punctuation), we don't have to find them within larger images, etc.  The more of those preconditions are eliminated, the closer they get to the tasks our brains are actually doing, and the more complex the problem gets... and the more I will believe that a more correct network is more complex.

Neural Networks and Deep Learning

So, the title of the post serves two purposes - one, to serve notice to this blog (hah - as though anyone reads this!) that these are topics that I've recently become very interested in, and will likely be posting about a lot, and two, to let everyone know about an awesome online book of the same name, http://neuralnetworksanddeeplearning.com

It's a REALLY great resource for people looking to get started with neural networks. Of course, everyone learns in different ways, so I should clarify that I'm someone who likes to get a good mix of practical knowledge and theoretical underpinnings... but if that sounds like you, then I can't recommend this book highly enough. In a relatively short amount of text, he gives a broad enough overview of the field that I really felt I could start diving into the topic - reading papers, and tinkering with code - while still managing to go in depth enough into his topics that I felt I had a decent understanding of how (or why) they worked. It's a rare feat... kudos, Michael Nielsen!

Monday, April 18, 2016

My quick opinions of Eclipse CDT vs CLion

In my last post, I mentioned I found a neat feature in Eclipse, but that I still use CLion most of the time. This begs the obvious question: Why don't I just use Eclipse / CDT full time?

Well, I initially switched because of two main reasons: 1) annoyance with the time it always took getting the IDE setup to recognize all my various include paths, library paths, options, etc... and 2), the fact that CODAN* (the static analyzer in CDT) seems to miss a lot of situations. So I decided to give it another shot.

I mostly work with CMake projects - a given, since I'm using CLion, and it's biggest downside is it ONLY works with CMake - so I used the CMake Eclipse project generator. It seemed to work fairly well, which helped with 1)... but longer term, the fact that I would potentially need to re-run it any time the CMake changed... which in turn would mean I could lose any project settings / changes I made from within Eclipse - is a worry. Plus the fact that I have to run a separate command line tool frequently is a turn off.

Still, those are things I could deal with... except it seemed that 2), CODAN's unreliable, is still a factor. I had only been using it for 5 minutes before coming across a situation which CODAN incorrectly flagged as an error, but which CLion got right. I always find it annoying to have all those extraneous red highlights in my IDE, so back I went to CLion...

...but, I still keep Eclipse open for that Shift+Alt+T action!

*...though, if this is named after the infamous Armada in "The Last Starfighter," thumbs up to that!

IDE shortcuts for adding function definitions

So when I'm writing C++ (or C) code, I constantly find myself either writing a function inside the header (.h) file, and later wanting to move it to the the implementation (.cpp) file.... or KNOWING I'll want it in the .cpp at the outset, but having to re-type out all the boilerplate in the .cpp again.

I mostly use CLion these days for C++ development... and while it's generally a nice IDE, I was dismayed to find it doesn't have couldn't find a refactor option for this*. Eclipse, however, has one: "Toggle Function defintion". You can get at it right clicking in the function definition, going to "Refactor" > "Toggle Function Defintion"... or by just pressing Shift+Alt+T.

It has a few caveats, however:
  • You have to do each function one at a time 
  • It's a two step process - it will first shift it to outside the class declaration, but still within the header file, as an inline definition... you then have to scroll down to it, click on it again, and move it to the .cpp file. 
  • You have to have a defintion; so for cases where I know I'm going to want it in the .cpp at the outset, I have to add an empty definition ({}), then click on it, before moving. 
The last gripe is pretty minor, but the other two cost time... though they STILL save enough time that I find myself keeping a copy of Eclipse open, alongside my copy of CLion, just to use this feature.

Why don't I just use Eclipse / CDT full time? Good question... but my answer started to veer off topic a bit, so I put it in another post..

*Update! A friend showed me that CLion does, indeed have an intention - just go to the declaration, press "Alt-Enter", then choose "Implement function 'foo'" - or, if it's already defined in the header, click on the name, hit"Alt-Enter", then choose "Move function definition to source file".  Huzzah!

Friday, October 26, 2012

How do you find where a python attribute "comes from"?

I work on a python project in the visual-effects industry called pymel, and someone recently asked me about the 'vtx' "attribute" on a mesh object - where it came from, and how they would find that out.  The answer is that it's added using the __getattr__ method on the Mesh class... but this got me thinking - is there a general way to find where a given attribute "comes from?"

When classes use tricks like __getattr__, it's hard to determine - standard methods like using dir or searching through the mro's __dict__ entries won't help.

The only way I could think of to find out it's from a __getattr__ would be to march up the mro chain, looking for __getattrs__, and testing them - to see if they return a result for the desired attribute, and at what point that result changes.

So, I wrote a function which does just that... and while it's at it, also checks the __dict__, __slots__, and __getattribute__.  It even does a last-ditch check to see if it's a c-compiled object. It's designed to generally tell you, "where the heck did this attribute come from"?
In order to get all (or at least, most) of the edge cases right, it ended up being way more complex than I'd originally imagined.. but it seems to get things right nearly all of the time. Hopefully it helps someone!

Thursday, July 9, 2009

Getting the type of OS

Just something that came up recently - how to use python to find what OS the user is running.

If you only need the general "type" of os (ie, Windows, Mac OSX, Linux), I've found the best solution is to use os.name first, as it's result is guaranteed to be one of only a few values, and then platform.system() to get finer grained results (ie, differentiate between OSX and *nix).

The problem with just using platform.system() is that it's return value isn't "guaranteed" be one of a few values, as the Python community learned when Vista was released. Whereas previous versions of Windows would return "Windows", Vista returns "Microsoft". While this may be considered a bug, it taught me that it's probably safest to just use os.name when possible, and only resort to other methods when this doesn't provide enough information.

Unfortunately, this is the case when trying to differentiate OSX and Unix/Linux - os.name returns 'posix' for both. So there, we resort to platform.system(), which will return 'Darwin' or 'Linux'.

Of course, you may need more fine-grained information, and there's other functions that can help with that - platform.uname(), platform.system_alias(), platform.uname(), maya.cmds.about() if Maya is initialized, etc.

Wednesday, July 8, 2009

Mirorring blendshapes, and downloading scripts

So I needed to mirror a blendshape again, and didn't want to do it manually, so (as usual) my first impulse was to write a script to do it... but this time, I actually did the smart thing, and just checked highend3d for one.

I found a bunch, of course. And settled on this one - ntMirrorBlendShape by Nelson Teixeira - solely because it had the most downloads. It does exactly what I needed, though, so thanks Nelson!

However, my pull to write my own solution - even in cases such as this where the smarter option is clearly to use existing code - never seems to go away. For instance, I inevitably find myself wanting features not already present - in this case, topological symmetry, ala that in Mudbox / Silo / Zbrush. (And, in this case, it wouldn't be that hard to implement, since I've already written code that does the hard part - topological matching of two non-mirrored meshes - that I originally wrote to fix cases where vert order somehow got messed up. Adapting it for use with symmetry should be pretty simple...) Or mirroring across axes other than X. Never mind that I don't actually NEED those features to solve my initial problem... the itch persists.

For now, I've been able to remind myself that time is of the essence - I really need to finish up my demo reel - but I know the problem will pop up again at some point. Do you think anyone makes some sort of soothing spray for Scripter's Itch?