At this point we’ve attempted fairly simple machine learning
algorithms. Time to bring out the big guns...
Neural networks are one of the most commonly used algorithms for handwriting recognition. The basic idea behind a neural network is actually fairly simple (though the implementation is fairly difficult).
Backtracking for a second: with the logistic regression, we had
input variables (one for each pixel) and output variables (one for each digit
0-9). In other words, we had a layer of input variables and a layer of output
variables.
Example logistic classifier
Neural networks are similar, but with a bunch of other stuff
thrown in the middle. In between the input and output layers are 1 or more hidden layers, which do internal processing
of the data. For a given layer, each node is connected with all the nodes in
adjacent layers. The network is supposed to be similar to actual neurons in the
brain.
Example neural network
Every connection in the neural network has a weight associated
with it. Each layer is learned in succession, starting off with the first
hidden layer and ending with the output layer. The value of a node is
determined by the weighted sum of all the nodes in the previous layer, which is
then run through the logistic function. It ends up being like a bunch of logistic
classifiers chained together.
Similarly to the logistic regression, training the neural
network consists of finding the weights that minimize a cost function. The cost
function is minimized with an iterative algorithm that requires knowing the
cost function gradient (where each weight is a variable). Finding the gradient
requires an algorithm for this is known as back propagation,
which is an incredibly complicated and unintuitive process. The implementation
of it is very prone to errors, so this was the cause of much stress.
To start off, we created a neural network consisting of three
layers. The first layer - the input layer - had 784 neurons, one to represent
each of the 28x28 pixels in the image. The final layer - the output layer - had
10 neurons, one to represent each of the 10 digits that were being read in.
This leaves the middle layer, or the hidden layer. To our surprise, the number
of neurons in this layer don’t actually represent anything. The Coursera course
- and other resources from a quick Google search - had some black magic rules
for determining the size of this hidden layer. For starters, we used a 20
neuron hidden layer, and later we tried to find an optimal size. (Also, per
suggestion of the coursera course, every layer except the output layer included
a biasing neuron. This neuron, for each layer, always had a value of 1. It was
connected to every neuron in subsequent layers, but there were no connections
to it from the previous layer.)
We used all 30k training samples when setting up the neural
network. Because the network is so complex, training takes a long time (an hour
or two). Overall, our network ended up making predictions on the Kaggle test
set with 91% accuracy. Not as high as we were hoping, but not bad either!
Since neural nets are supposed to be very powerful, we knew
there had to be a way to improve its accuracy – and again we turned to
parameter optimization. In particular, we tried varying the number of nodes in
our hidden layer and increasing the number of hidden layers. We tried 100, 150,
and 200 hidden layer nodes for networks with one hidden layer and two hidden
layers. (In a perfect world we would have liked to try more values of the
parameters. However, when it takes up to 2 hours to train a configuration, 6
different configurations is already pushing the limits of our computers). The
conventional wisdom with neural nets is that bigger is better – so we predicted
that more hidden layers with more nodes would improve our accuracy. However,
sometimes this heuristic doesn’t hold, so we wanted to be sure we weren’t
overshooting. We trained each of these configurations with 24k samples, leaving
6k samples for cross-validation. Results from cross-validation for each
configuration are plotted below.
As far as hidden layer size goes, bigger does mean better (for
the most part). However, what is surprising is that the 2 hidden layer network
was less accurate than the 1 layer network. One possible explanation for this
is that the cost function minimization algorithm did not run enough iterations.
Because there are a lot more parameters to optimize for a 2 hidden layer
network, it would make sense that it should take longer to arrive at a true
minimum. Thus the parameters that the 2 hidden layer network were trained with
may not have been optimal.
Using the most accurate network configuration (1 hidden layer with
200 nodes), we obtained 93% accuracy with the Kaggle test set. That’s a 2
percent accuracy boost from our first neural net attempt. It’s a bit sad that
the simple k-nearest neighbor algorithm worked better than this more complex
one, but perhaps neural networks would be better if we did further optimizations
to increase accuracy. There are other parameters we could choose to look at for
optimization – such as the regularization parameter. We could also run cost
function minimization for more iterations to ensure that our weights have
converged on their optimal values (especially useful for the 2 hidden layer
network). Nevertheless, the simplest way to improve accuracy would be to have
more training samples. 30k is definitely a lot, but neural networks perform
even better when there’s on the order of 100k samples.
Check out our neural network implementation here: https://github.com/rachelbobbins/machinelearning/tree/master/neural_network
Well, that covers all the handwriting recognition algorithms we
tried. In our next entry we’ll recap and talk about future steps. Then it’s
onto our next project!
No comments:
Post a Comment