This algorithm was a last-ditch* attempt to break the 90% success rate as we wrapped up our effort on the digit recognition challenge. The kNN benchmark on Kaggle had a 96.667% success rate, and it used an algorithm not covered in Coursera. So, doing my own implementation seemed like a) a way to learn about a new algorithm and b) a likely way of beating the 90% mark.
The k-Nearest Neighbor (kNN) algorithm is surprisingly simple for such an accurate classifier. Given a a test sample "t", it:
1. Determines the euclidean distance between t and every individual training example.
2. Finds the k closest training examples, and their known values/labels (in this case, 0-9)
3. Classifies "t" as the label of its mode of the k nearest neighbors.
So, for sample "t" and k=10, let's say the nearest training examples have labels [1, 1, 2, 1, 1, 7, 1, 1, 0, 2]. The algorithm then labels "t" as 1.
We ran the algorithm with an arbitrarily chosen k=15. It took forever (~16 hours) to run on my MacBook Air, with an Intel i5 processor and 4 GB of RAM. It was 95.8% accurate. This was slightly less accurate than Kaggle's success rate, and probably could have been attributed to the fact that they used a different value of k (10).
Curious about how k's value would impact the algorithm's accuracy, I wrote an optimization function that would try values of k from 1-20, and return the value of k which gave the highest accuracy. It split the training data into two groups - one for training, and one for cross validation. It used 28k training examples to predict the label for 2k training examples, then compared its predictions to the actual labels. Surprisingly, different values of k did not impact the algorithm's accuracy - it generally achieved between 95 and 97% accuracy. Because the kNN algorithm is so time-intensive, I did not run it again using the best k.
(*Chronologically speaking, we did this after working on a neural networks approach... but that post hasn't been published yet)
No comments:
Post a Comment