Sunday, October 28, 2012

Our first competition, cont. (Part 2: Logistic Classifier)

If you missed our last entry - we just finished competing in a digit recognition competition on Kaggle. Our first attempt at digit recognition used a very simplistic algorithm (linear average) which resulted in 80.4% accuracy. If a simple algorithm could achieve this accuracy, imagine what a more complex algorithm could achieve! This bring us to...


Approach #2: Logistic Classifier


The logistic classifier was one of the first algorithms we learned in the coursera course. It is a pretty simple algorithm, but it can be pretty powerful if used in the right application. Logistic classifiers are explained in detail on Coursera. To gain some intuition as to how they work and why they are useful, check out this page.

Turns out logistic classifiers aren’t very great for handwriting recognizers. But before we get to the results, a little bit about how they work... 


Logistic classifiers are made of a series of logistic regressions. A logistic regressions takes a series of weighted input variables (which for our purpose are each of the 784 pixels plus a biasing term) and uses the logistic function to classify samples into one of two categories - either yes or no. Since we wanted to classify samples into 10 categories - one for each digit - we ended up using 10 separate regressions. One would determine the probably of the sample being a zero or not, one would determine the probability of being a one or not, etc. The number that produced the highest probability would be the number assigned to that sample. This is known as the “one versus all” approach.


When it comes to assigning weights to input variables, the same technique is used for almost every machine learning algorithm. For any given set of weights, we assign the classifier a “cost” that represents the error. A cost function of 0 would mean that that the weights perfectly classify samples every time, whereas a cost of infinity would mean that the weights classify samples incorrectly every time. (If you think about it, both cases are pretty impressive).


For the logistic classifier, the cost is determined by the following function:
Without getting into too much detail, x represents the values of each pixel of the image, y represents the digit that the image is supposed to correspond to, h_theta represents weights and the logistic function, and m represents the number of training samples. In essence, what this function is doing is seeing on average how far off the predictions are from the actual values. Therefore, we want to find a set of weights that minimizes this cost function.

(In our final version of the classifier, we used a slightly fancier cost function that included something known as a regularization term. This is used to prevent data overfitting - which is when your weights work perfectly for your training set but then fail with your test set. You can read more about it here)

Here’s the basic idea behind finding the right set of weights: first you randomly guess the weights and calculate the cost function associated with the weights. You then improve this guess by finding the gradient of the cost function (with respect to the weights). The direction of the gradient tells you how you should change the weights in order to lower the cost function. After repeating this process a number of times, you will converge on values that minimizes the cost.

We trained the classifier with all 30k training samples and then used it to predict the digits for all 28k Kaggle test samples. Our algorithm had an accuracy of 81.2%, which, surprisingly, was not much better than we got with the overly-simple linear average algorithm. For all the extra complexity of the logistic classifier, we only gained a 0.8% accuracy boost!

When we did a similar digit recognition exercise on Coursera and got much higher accuracy with a logistic classifier (around 90%), so we looked for areas of improvement. We chose to optimize two parameters in our model that we chose somewhat arbitrary. The first was a parameter associated with the cost function’s regularization term. The second parameter was the number of iterations used for our cost minimization algorithm. (The more iterations we ran, the more accurate the weights became, but only by a marginal amount). We trained 22.5k samples with different combinations of these parameters, and then determined which parameter combination was the most accurate by testing each set of weights on a cross-validation set of 7.5k samples.

After finding the optimum weights, we made predictions for the 28k Kaggle test samples. Our accuracy did not improve much - only by 0.2%. This is not much improvement (especially since it took our computers several hours to find the optimal parameter set!), and we are still far off of the 90% accuracy we achieved with the Coursera exercise.

One hypothesis we have for this weaker performance is that the images in the Coursera dataset were smaller than the Kaggle images. The Coursera images were 20x20 pixels, which comes out to nearly half of the pixels in the Kaggle images. Though this means that there are fewer variables to use (recall that each pixel is a variable), there will also be less variation between images that represent the same number. So if the Kaggle dataset was made of smaller images, we probably would have much better accuracy.

Check out our implementation here on our Github repo: https://github.com/rachelbobbins/machinelearning/tree/master/logistic_classifier

No comments:

Post a Comment