Rachel + Geoff Study ML

Sunday, November 4, 2012

Our first competition, cont. (Wrapping it all up - actually)

So this is our final post on the handwriting recognition competition! It was swell, but now it’s time to say goodbye...

In summary: out of the four algorithms we tried, k-nearest neighbors was the most accurate (95%), followed by our neural network (93%). Our logistic classifier was the next most accurate (81%), followed closely by our linear average classifier (80%). We tried parameter optimization for the logistic classifier and the neural network, and received 0.2% and 2% accuracy boosts, respectively.

Did we try all we could have tried? No. If we were really serious about the performance of our algorithms, there’s a lot we could have investigated - better optimizations, performance enhancements, and more advanced techniques. For us, the purpose of this competition was to introduce us to machine learning, and this was definitely accomplished. We implemented 4 different algorithms from scratch, learned how to test algorithmic performance, and even tackled some machine learning system design with parameter optimization. I’d say our goal was more than accomplished. For our next project, we’ll be sure to try much more advanced techniques, think more carefully about system design, and try to test out any hypothesis we may have for improving our system.

Here are some major things we learned along the way:

More complicated does not mean better. We saw this with the k-nearest neighbor classifier - which was conceptually simple and easy to implement. It ended up outperforming all our other algorithms, despite other algorithms being more complex (especially neural networks).

The 80-20 rule really applies. We coded our first algorithm in under an hour, and right away that gave us 80% accuracy. The rest of our time was spent trying to improve on this. When we tried optimizing parameters, we didn’t get much improvement either. However, there’s a huge difference between 95% accuracy and 80% accuracy. Consider how the USPS uses handwriting recognition to read addresses off of letters. Imagine if their classifier only had 80% accuracy - that would be a lot of letters delivered to wrong addresses. Even if it had 99% accuracy - if they deliver millions of letters a day that’s tens of thousands of letters that get incorrectly sent. If precision matters, every tenth or hundredth of a percent counts.

Visualizing results is helpful. When dealing with big data, little bugs can go unnoticed, especially if they don’t seem to have much effect on the system. Sometimes the best solution is to slow down and watch as the computer trains or tests each sample individually. With digit recognition, it helped a lot to display the picture of each training or test sample as the algorithm classified it.

If we were to make our recognizer more accurate, here’s some things we would try:

Optimize better. We could try increasing the number of different values we test over for each parameter, or we could try optimizing over different parameters.

Image preprocessing. We tried a bit of this at the end, but without much success. To decrease the variation between samples, we could rotate each sample to be facing the same direction, remove any border on the samples, etc. Less variation means more accurate detection.

Introduce artificial training samples. More training samples means more accuracy. Often this will produce better results than trying to improve the algorithm. Though we used all the Kaggle training samples, we could have made some of our own. One way to do this is to make modified copies of the existing samples. For example, it is possible that some digits had thicker line widths than others. If we take each sample, make 3 copies of it with varying line widths and use all of these samples for training, we could multiply the our training sample size by 4. All of the artificial examples would be plausible as real examples, so this could increase our accuracy.

Well, that’s all we have on handwriting recognition. Next up: identifying dark matter halos!!!
(And in case you’re wondering, we are currently placed 429th for the handwriting recognition competition. It’s a popular one!)

Friday, November 2, 2012

Our first competition, cont. (Part 4: Neural Networks)

At this point we’ve attempted fairly simple machine learning algorithms. Time to bring out the big guns...

Neural networks are one of the most commonly used algorithms for handwriting recognition. The basic idea behind a neural network is actually fairly simple (though the implementation is fairly difficult).

Backtracking for a second: with the logistic regression, we had input variables (one for each pixel) and output variables (one for each digit 0-9). In other words, we had a layer of input variables and a layer of output variables.

Example logistic classifier

Neural networks are similar, but with a bunch of other stuff thrown in the middle. In between the input and output layers are 1 or more hidden layers, which do internal processing of the data. For a given layer, each node is connected with all the nodes in adjacent layers. The network is supposed to be similar to actual neurons in the brain.

Example neural network

Every connection in the neural network has a weight associated with it. Each layer is learned in succession, starting off with the first hidden layer and ending with the output layer. The value of a node is determined by the weighted sum of all the nodes in the previous layer, which is then run through the logistic function. It ends up being like a bunch of logistic classifiers chained together.

Similarly to the logistic regression, training the neural network consists of finding the weights that minimize a cost function. The cost function is minimized with an iterative algorithm that requires knowing the cost function gradient (where each weight is a variable). Finding the gradient requires an algorithm for this is known as back propagation, which is an incredibly complicated and unintuitive process. The implementation of it is very prone to errors, so this was the cause of much stress.

To start off, we created a neural network consisting of three layers. The first layer - the input layer - had 784 neurons, one to represent each of the 28x28 pixels in the image. The final layer - the output layer - had 10 neurons, one to represent each of the 10 digits that were being read in. This leaves the middle layer, or the hidden layer. To our surprise, the number of neurons in this layer don’t actually represent anything. The Coursera course - and other resources from a quick Google search - had some black magic rules for determining the size of this hidden layer. For starters, we used a 20 neuron hidden layer, and later we tried to find an optimal size. (Also, per suggestion of the coursera course, every layer except the output layer included a biasing neuron. This neuron, for each layer, always had a value of 1. It was connected to every neuron in subsequent layers, but there were no connections to it from the previous layer.)

We used all 30k training samples when setting up the neural network. Because the network is so complex, training takes a long time (an hour or two). Overall, our network ended up making predictions on the Kaggle test set with 91% accuracy. Not as high as we were hoping, but not bad either!

Since neural nets are supposed to be very powerful, we knew there had to be a way to improve its accuracy – and again we turned to parameter optimization. In particular, we tried varying the number of nodes in our hidden layer and increasing the number of hidden layers. We tried 100, 150, and 200 hidden layer nodes for networks with one hidden layer and two hidden layers. (In a perfect world we would have liked to try more values of the parameters. However, when it takes up to 2 hours to train a configuration, 6 different configurations is already pushing the limits of our computers). The conventional wisdom with neural nets is that bigger is better – so we predicted that more hidden layers with more nodes would improve our accuracy. However, sometimes this heuristic doesn’t hold, so we wanted to be sure we weren’t overshooting. We trained each of these configurations with 24k samples, leaving 6k samples for cross-validation. Results from cross-validation for each configuration are plotted below.

Cross-validation results from parameter optimization

As far as hidden layer size goes, bigger does mean better (for the most part). However, what is surprising is that the 2 hidden layer network was less accurate than the 1 layer network. One possible explanation for this is that the cost function minimization algorithm did not run enough iterations. Because there are a lot more parameters to optimize for a 2 hidden layer network, it would make sense that it should take longer to arrive at a true minimum. Thus the parameters that the 2 hidden layer network were trained with may not have been optimal.

Using the most accurate network configuration (1 hidden layer with 200 nodes), we obtained 93% accuracy with the Kaggle test set. That’s a 2 percent accuracy boost from our first neural net attempt. It’s a bit sad that the simple k-nearest neighbor algorithm worked better than this more complex one, but perhaps neural networks would be better if we did further optimizations to increase accuracy. There are other parameters we could choose to look at for optimization – such as the regularization parameter. We could also run cost function minimization for more iterations to ensure that our weights have converged on their optimal values (especially useful for the 2 hidden layer network). Nevertheless, the simplest way to improve accuracy would be to have more training samples. 30k is definitely a lot, but neural networks perform even better when there’s on the order of 100k samples.

Check out our neural network implementation here: https://github.com/rachelbobbins/machinelearning/tree/master/neural_network

Well, that covers all the handwriting recognition algorithms we tried. In our next entry we’ll recap and talk about future steps. Then it’s onto our next project!

Thursday, November 1, 2012

Our first competition, cont. (Part 3: k-Nearest Neighbors)

This algorithm was a last-ditch* attempt to break the 90% success rate as we wrapped up our effort on the digit recognition challenge. The kNN benchmark on Kaggle had a 96.667% success rate, and it used an algorithm not covered in Coursera. So, doing my own implementation seemed like a) a way to learn about a new algorithm and b) a likely way of beating the 90% mark.

The k-Nearest Neighbor (kNN) algorithm is surprisingly simple for such an accurate classifier. Given a a test sample "t", it:

1. Determines the euclidean distance between t and every individual training example.
2. Finds the k closest training examples, and their known values/labels (in this case, 0-9)
3. Classifies "t" as the label of its mode of the k nearest neighbors.

So, for sample "t" and k=10, let's say the nearest training examples have labels [1, 1, 2, 1, 1, 7, 1, 1, 0, 2]. The algorithm then labels "t" as 1.

We ran the algorithm with an arbitrarily chosen k=15. It took forever (~16 hours) to run on my MacBook Air, with an Intel i5 processor and 4 GB of RAM. It was 95.8% accurate. This was slightly less accurate than Kaggle's success rate, and probably could have been attributed to the fact that they used a different value of k (10).

Curious about how k's value would impact the algorithm's accuracy, I wrote an optimization function that would try values of k from 1-20, and return the value of k which gave the highest accuracy. It split the training data into two groups - one for training, and one for cross validation. It used 28k training examples to predict the label for 2k training examples, then compared its predictions to the actual labels. Surprisingly, different values of k did not impact the algorithm's accuracy - it generally achieved between 95 and 97% accuracy. Because the kNN algorithm is so time-intensive, I did not run it again using the best k.

(*Chronologically speaking, we did this after working on a neural networks approach... but that post hasn't been published yet)

Sunday, October 28, 2012

Our first competition, cont. (Part 2: Logistic Classifier)

If you missed our last entry - we just finished competing in a digit recognition competition on Kaggle. Our first attempt at digit recognition used a very simplistic algorithm (linear average) which resulted in 80.4% accuracy. If a simple algorithm could achieve this accuracy, imagine what a more complex algorithm could achieve! This bring us to...

Approach #2: Logistic Classifier

The logistic classifier was one of the first algorithms we learned in the coursera course. It is a pretty simple algorithm, but it can be pretty powerful if used in the right application. Logistic classifiers are explained in detail on Coursera. To gain some intuition as to how they work and why they are useful, check out this page.

Turns out logistic classifiers aren’t very great for handwriting recognizers. But before we get to the results, a little bit about how they work...

Logistic classifiers are made of a series of logistic regressions. A logistic regressions takes a series of weighted input variables (which for our purpose are each of the 784 pixels plus a biasing term) and uses the logistic function to classify samples into one of two categories - either yes or no. Since we wanted to classify samples into 10 categories - one for each digit - we ended up using 10 separate regressions. One would determine the probably of the sample being a zero or not, one would determine the probability of being a one or not, etc. The number that produced the highest probability would be the number assigned to that sample. This is known as the “one versus all” approach.

When it comes to assigning weights to input variables, the same technique is used for almost every machine learning algorithm. For any given set of weights, we assign the classifier a “cost” that represents the error. A cost function of 0 would mean that that the weights perfectly classify samples every time, whereas a cost of infinity would mean that the weights classify samples incorrectly every time. (If you think about it, both cases are pretty impressive).

For the logistic classifier, the cost is determined by the following function:
Without getting into too much detail, x represents the values of each pixel of the image, y represents the digit that the image is supposed to correspond to, h_theta represents weights and the logistic function, and m represents the number of training samples. In essence, what this function is doing is seeing on average how far off the predictions are from the actual values. Therefore, we want to find a set of weights that minimizes this cost function.

(In our final version of the classifier, we used a slightly fancier cost function that included something known as a regularization term. This is used to prevent data overfitting - which is when your weights work perfectly for your training set but then fail with your test set. You can read more about it here)

Here’s the basic idea behind finding the right set of weights: first you randomly guess the weights and calculate the cost function associated with the weights. You then improve this guess by finding the gradient of the cost function (with respect to the weights). The direction of the gradient tells you how you should change the weights in order to lower the cost function. After repeating this process a number of times, you will converge on values that minimizes the cost.

We trained the classifier with all 30k training samples and then used it to predict the digits for all 28k Kaggle test samples. Our algorithm had an accuracy of 81.2%, which, surprisingly, was not much better than we got with the overly-simple linear average algorithm. For all the extra complexity of the logistic classifier, we only gained a 0.8% accuracy boost!

When we did a similar digit recognition exercise on Coursera and got much higher accuracy with a logistic classifier (around 90%), so we looked for areas of improvement. We chose to optimize two parameters in our model that we chose somewhat arbitrary. The first was a parameter associated with the cost function’s regularization term. The second parameter was the number of iterations used for our cost minimization algorithm. (The more iterations we ran, the more accurate the weights became, but only by a marginal amount). We trained 22.5k samples with different combinations of these parameters, and then determined which parameter combination was the most accurate by testing each set of weights on a cross-validation set of 7.5k samples.

After finding the optimum weights, we made predictions for the 28k Kaggle test samples. Our accuracy did not improve much - only by 0.2%. This is not much improvement (especially since it took our computers several hours to find the optimal parameter set!), and we are still far off of the 90% accuracy we achieved with the Coursera exercise.

One hypothesis we have for this weaker performance is that the images in the Coursera dataset were smaller than the Kaggle images. The Coursera images were 20x20 pixels, which comes out to nearly half of the pixels in the Kaggle images. Though this means that there are fewer variables to use (recall that each pixel is a variable), there will also be less variation between images that represent the same number. So if the Kaggle dataset was made of smaller images, we probably would have much better accuracy.

Check out our implementation here on our Github repo: https://github.com/rachelbobbins/machinelearning/tree/master/logistic_classifier

Friday, October 26, 2012

Wrapping up our first Kaggle competition!

Intro

This semester at Olin College, we (Rachel and Geoff!) are doing a self-study on machine learning. Over the past 2 months, we worked our way through Andrew Ng’s ML course offered via Coursera. At the same time, we’ve been trying to hone our skills by applying some of the class’ techniques to the digit recognition competition on Kaggle. As humans, we’ve learned a lot about how our mechanical counterparts learn (somewhat meta!) So, keep reading for a summary of what we’ve done, and what we’ve gotten out of it.

The Competition

Kaggle's competitions are normally sponsored by companies looking to find the best solutions for their data science problems. Typically, winners are rewarded with money, jobs, or some combination of the two. The digit recognition competition is unique, because it offers neither of these. It is an entry level competition for people who are learning data science for the first time, or have never entered a Kaggle competition before. Since we fall into both these categories, it seemed like the perfect way to wet our feet!

We approached the competition with the mindset of "we're here to learn, not to win" (we didn't really have a choice, because unlike other Kaggle competitions, this one will never declare a winner). Thus our goal was to learn and apply several techniques, not maximize our success with one technique. The techniques we used will be described in this (and subsequent) blog posts. But first, it probably helps to understand the data we were dealing with.

The Data

We used the data provided by Kaggle, which is best described by them. You can go here to see it, but we've also copied it below, for your convenience.

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).

For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.

Approach #1: Linear classifier

We jumped into this approach before we even started the coursera course. It was inspired by the baseline linear classifier described by LeCun et al. in their comparison of techniques for handwritten digit recognition. The algorithm we used is described in a linear algebra presentation from UC Davis, available online.

It uses the training data (all 30k samples) to calculate what an "average" n looks like, where n = 0, 1, ...9. Below are 10 plots, each representing an average digit. Each digit was plotted as a graph, with point (x, y) being the average value of pixel (x, y) for that digit. Because the digits were plotted as graphs, there is whitespace between points. To compensate for the whitespace, the images were adjusted for contrast in photoshop. Cool, right?

After figuring out what the average digits look like, we iterated over each of the 28k test samples to figure out which average it most closely resembles -this is our prediction. All 28k predictions are submitted to Kaggle, which provides instant feedback on our success. In this case, we were 80.471% accurate.

To be honest, this was a lower success rate than we expected. In the paper, the authors described an error rate of 8.4% (or, a success rate of 91.6%). The difference between our success rates is that the authors used a bias unit in their calculation, and also deslanted the images before running their algorithm. We did not choose to replicate these steps, as we were eager to try other methods. If you're interested in examining all the technicalities of our implementation, check out our Github repository: https://github.com/rachelbobbins/machinelearning/tree/master/linear_classifier