Rachel + Geoff Study ML: 2012

Wednesday, December 19, 2012

All's well that ends well

After our update last Thursday, we thought we were done with the competition. It was time to finish up the rest of our finals and enjoy one last weekend at Olin. It had been a good run, but we didn't meet our original goal of beating the lenstool method.

Today I got home for winter break, excited to catch my dad (hi dad!) up on everything that happend this semester. I was telling about about the competition, and went to show him the leaderboard - it had been updated since the competition ended on Sunday. The final results were now visible! Recall that the public leaderboard was only based on 25% of predictions - the private leaderboard (based on the other 75% of predictions) determines the final winner.

Geoff and I came in 96th out of 357 participants! This is way better than our public leaderboard rank of 186th place. Furthermore, our score of 0.89688 blew the lenstool's score of 1.05 out of the water. We beat our original goal!

Thursday, December 13, 2012

A (moderately embarassing) update

After finishing this blog last night, we were excited to share it with other participants in the competition. We posted a link on the competition's discussion forum. Today, we got the following response:

...and that, my readers, is a giant glaring "OOPS" on our part. We spent a lot of time trying to predict the number of halos, when in fact, this data was given to us. On the one hand, it means our final "sketchy" attempt was actually legitimate. On the other hand... we spent *a lot* of time working on a non-existent problem. On the third hand (we've got four of them between the both of us)...we've got another "lesson learned" to add to our list from the last post:

Read and re-read the rules of a competition before diving into it.

Wednesday, December 12, 2012

Conclusion

We didn't meet our original goal of beating the LENSTOOL metric, but realistically, this was more of a learning experience for us than a competition. By viewing it primarily as a learning experience, we can confidently say that it has been a resounding success. Below are some of the things we learned:

1. Machine learning is not a "one approach fits all" type of field. None of the techniques we learned during the digit recognition competition were helpful, and recognizing this took a couple of weeks. It was a disheartening realization at first, but recognizing forced us to think more creatively than we might have otherwise.

2. It's possible to devise an effective ML algorithm without lots of training data. We didn't figure out how to do this, but some people evidently did (as shown in the leaderboard screenshot below).

We're excited to find out about the approaches that successful competitors used after the competition ends; these might serve as inspiration to us for future ML endeavors.

3. It's helpful to have an advanced degree when solving a problem for which canonical machine learning algorithms are not applicable. Looking at the profiles of top competitors, we saw that most of them had advanced degrees in areas like computer science and data mining; one guy even had a PhD in particle physics.

4. Our best legitimate approach was the very first one, which combined two different benchmark methods. It was the simplest method by far - perhaps the old adage is true, that simpler is better.

5. Spiral learning is a legitimate educational philosophy! This project reminded us both of "Modeling and Simulation", a required course for first-year students at Olin - we took it 3 years ago in Fall 2009 (class website). In that class we were taught the governing equations for certain real world systems, and learned the programming/modeling skills necessary to explain and predict the behavior of those systems. This competition enabled us to tackle a similar problem, but to a much higher degree - this time, we figured out the governing equations ourselves! It was really satisfying to use the same approach that we used freshman year, this time with three years of experience and confidence behind each of us.

Overall, we didn't do as well as we'd hoped, but we definitely learned a lot from the competition. It was a great experience to work on a real-world problem, with data that had to be deciphered and with new concepts that had to be learned. We were able to attack a difficult problem in a field which we had no prior knowledge about, and we were able to do so in creative ways. More than anything, this competition has given us the confidence and desire to tackle more data science problems in the future.

One last hurrah

After giving up, the lingering question was "how well could we have done, given more time to devise an accurate method for predicting the number of halos?" Neither of us are big fans of unanswered questions, so we took a semi-sketchy approach to answering this question...

The competition's best benchmark predictions were produced by the existing LENSTOOL software, and those predictions were available to all participants. We had noticed early on that in these predictions, skies 1-40 each have one halo, skies 41-80 each have two halos, and skies each 81-120 each have three halos. We used this information and our position method to determine the positions of these known quantities of halos.

This last try achieved a score of 1.08476. It did not beat the LENSTOOL method (our original goal), but it was damn close. How well could we have done with more time to devise an accurate method for predicting the number of halos? Pretty well, in our humble opinion.

A New Hope

At this point in the competition, we had a pretty good method for predicting the position of halos. However, we had no accurate way of predicting the number of halos.

We realized that what we needed was an algorithm that can tell us more about each predicted halo so that we can determine if each predicted halo is real or not. So we switched gears and tried a new position predicting algorithm -- one that came to be known as the "better subtraction" algorithm, since it was a revision of the "subtracting the effect of known halos" method

Here's the basic idea behind "better subtraction": it predicts the first position of the halo using the signal method. It then "subtracts away" the effect of the first halo on every galaxy. The resulting sky is a good approximation of what the sky would look like without the first halo. The cycle of predict-and-subtract is repeated 4 more times, for a total of 5 predicted halo positions. Afterwards, we predict the number of halos (n) that exist in the sky, and the first n (out of 5) halos stand as our prediction

An in-depth look at "better subtraction"

So what does it mean to "subtract away" the effect of the first halo? Well, we know that halos cause galaxies to be more elliptical. If we can determine how much the first halo makes a galaxy elliptical, then we can modify the ellipticity of the galaxies to "remove" this halo!

To figure out how much to subtract away, we first create a copy of the sky in which all galaxies are perfect circles. We then figure out how elliptical the galaxies would be in this sky if the first halo was inserted into it. We'll define the amount of ellipticity due to the first halo with the components e1_h1 and e2_h2 (For a reminder of what these two coordinates mean, go here). Here's some facts that we know:

The amount that each galaxy is affected is proportional to 1/r, where r is the euclidean distance between the galaxy and the halo.
If each galaxy was naturally circular and there was only the one halo in the sky, than each galaxy would appear tangential to the halo.
We know that the tangential component of ellipticity can be calculated by the following equation:

-(e1_h1 * cos(2*phi) + e2_h1 * sin(2*phi))

where phi is the angle between the halo and the galaxy with respect to the x-axis.

Based on these two facts, we can predict e1_h1 and e2_h2 are with the following equations:

e1_h1 = (m / r) * -(tan(2*theta) sin(2*phi)) + cos(2*phi))

e2_h1 = e1_h1 * tan(2 * theta)

where theta is the angle of the galaxy's major axis with respect to the x-axis, and m is the relative strength of the halo.

There's just one problem: we don't know what m -- the relative strength of the halo -- is... We only know its location and the signal at its position in the sky.

However, we can get a really good guess for what m should by guessing and checking. Let's say the coordinates of the first halo are (x_h1, y_h1). If we remove the effects of the halo, then we can expect the signal at (x_h1, y_h1) to be 0.* If there's nothing at (x_h1, y_h1) that's affecting the ellipticity of the galaxies, then there should be no net orientation of galaxies about that point!

Alright, with all that information combined, we now have a new algorithm, which is written out in pseudocode below:

make an initial guess of m
calculate signal at (x_h1, y_h1)
while (signal at (x_h1, y_h1) != 0):
if the signal is > 0:
make m greater (halo is stronger)
if the signal is < 0:
make m smaller (halo is weaker)
for each galaxy in the sky
calculate (e1_h1, e2_h2) given m
subtract (e1_h1, e2_h2) from (e1, e2)
calculate signal at (x_h1, y_h1)

What this code is doing, in essence, is guessing what m is until the signal at x_h1, y_h1 is 0. Once we find m and "subtract away" the first halo, we can repeat the same process to predict where a second halo would be, and a third halo, etc. This algorithm turns out to be about as accurate as our last one at predicting halo positions. It can predict the positions of the first two halos with high accuracy, but is not so great at predicting the third. Given the position of halos, it scored a 1.08, which is about the same as our last algorithm. So not much of an improvement with predicting halo positions...

Using position to predict quantity

Here's the really cool thing about this algorithm -- for each predicted halo, we get a number -- m -- that represents how strong the halo is. This is a new measurement that we didn't have before! And this statistic provides us with useful data for predicting how many halos actually exist in the sky.

Here's what we can do with m: if m is a large number, we expect it's more likely that the predicted halo actually exists, because it is unlikely that such a strong source would be the product of randomness. If m is small, it is much less likely that the predicted halo actually exists (although it could just be a smaller halo).

Because the halos varied in strength, we chose to look at the ratio between m-values for the predicted halos. Let m₁ be the strength of the first predicted halo, let m₂ be the strength of the second predicted halo, etc. If the sky has only 1 halo, we expect the ratio of m₂:m₁to be low, since m₂represents the strength of a non-existent halo. Furthermore, we expect the ratio of m₃:m₂to be approximately 1, because they represent equally non-existent halos. Generally, in a sky with n halos, we predict that m_n+1:m_n will be much less than 1, and that m_n+2:m_n+1 will be approximately 1.

To see this idea in action, let's take a look at the two CDFs below. The CDFs show the ratios m₂:m₁ and m₃:m₂ for a bunch of skies that only contain one halo:

The m₂:m₁ ratios are all less than 1, and the m3:m2 ratios follow a more normal distribution, with a mean near 1. This is what we would expect given that there's only one halo in the sky. Science works.

Here's a pesudocode implementation of our method for predicting the number of halos in the sky:

n = 1
while True:
calculate m_n and m_n+1 (do this by predicting the position of the nth halo and the n+1th halo)
ratio_n = m_n+1 and m_n
if (ratio_n is approx 1)
guess that there are n-1 halos in the sky
break loop
else
n++

So at this point we had an algorithm for predicting the positions of halos, and an algorithm for predicting the number of halos as well! We combined both of them and submitted another guess to Kaggle...

... and got a score of 1.49. Worse than our first attempt :/ (Remember that our first attempt was just blindly slapping together two benchmark algorithms).

As we continued to focus on predicting the number of halos, we finally figured out a way to apply a traditional machine learning algorithm: k-nearest neighbors! This seemed like the best ML algorithm to use, because the training set wasn't big enough for a neural network or other classifiers. Using the ratios m₂:m₁, m₃:m₂, m₄:m₃, m₅:m₄ as "features", and our training set of 300 halos, we classified each sky as either a one halo, two halo, or three halo sky. Our score on the test data didn't improve at all.

At this point, we tried doing a bunch of optimizations to improve our score. Here are some other things we looked at

Subtracting away only a portion of the galaxies, and not the whole thing. We optimized over the training data to determine how much we should subtract away.
Using probabilistic ratios to inform our predictions. In the training set, and the test set, we knew that 1/3 of the skies had one halo, 1/3 had two, and 1/3 had 3. We incorporated this information into our algorithm.
Detecting anomalies. When we predicted the position of halos, a number of them appeared to be on the edge of the sky. We thought that these predictions might be anomalies, and we threw many of them out.

After trying all of these methods, our score didn't improve at all. And that's when we decided to give up.

Looking back, we think there were two main reasons our algorithm for predicting the number of halos didn't work very well.

Lack of training data. It's hard to machine learn with only 300 training samples.
Shaky position-predicting algorithm. Our algorithm for predicting the number of halos uses data from the position-predicting algorithm. Our position-predicting algorithm -- "better subtraction" -- was highly accurate at predicting the positions of the first and second halos, but not nearly as accurate in predicting the third. Therefore, it was hard to tell whether the predicted third sky was an actual sky or not. We confirmed this by looking at skies with the same number of halos. When there was only one halo in the sky, our algorithm predicted the number of halos correctly about 80% of the time. However, for two or three halos, this accuracy dropped down close to 40% -- not much better than random guessing.

In summary, if we wanted to improve our overall accuracy, our best bet would be improving our position-predicting algorithm. However, by this point we were close to the end of the semester, and we didn't really have time to try any new methods. And thus ended our dark matter quest.

* In reality, the signal at that point won't be zero, because other halos in the sky will be affecting the ellipticity of the galaxies. However, this simplifying assumption proved to be fairly accurate.

The black hole where code goes to die

Over the next few weeks, we tried a lot of approaches that did not work so well. The purpose of each of these approaches was to determine either a) the number of halos, or b) the positions of the halos, but not both. Our goal was to find a solution to each of these problems independently, then combine them into a single sophisticated algorithm. Below is a list of a few techniques that failed us:

Finding the positions of halos

1. Optimizing grid size in the gridded signal benchmark. Recall the gridded signal benchmark (provided by Kaggle) that contributed to our first submission in the competition. It works by dividing the sky up into a 15x15 grid, where each grid box is a 280x280 unit square. It calculates the signal at the center of the box based only on the galaxies contained by the box, and the box with the highest signal is deemed to be the one containing the halo.

We wanted to apply the techniques we'd learned from Coursera, and this benchmark lent itself well to parameter optimization. We tried several grid sizes (from 5x5 to 30x30, in increments of 5), and evaluated each one on the training data. The 25x25 grid worked best, but not by a significant margin. We applied the gridded signal method to the test data using this parameter, and achieved a score of 2.52128 when we submitted the results to Kaggle.

This shockingly terrible score (using its default parameters, the benchmark had a achieved a 1.58) was an artifact of how Kaggle scores competitors: it only analyzes 25% of a solution while the competition in progress; it analyzes the other 75% after the competition is over. The purpose behind this is to stop people from gaming the system - understandable, but frustrating. Given that the test data only has 120 skies, the score is calculated based on 30 predictions. It's simply not all that informative.

2. Subtracting the effect of known halos. This approach was our most original, and most successful. So successful in fact, that the entire next post will be devoted to it. Our first iteration of this method, however, was not so successful. Here's the basic idea: we calculate the position of the first halo, based off of the signal measurement. To find the position of the second halo, we calculated a modified signal which ignores off the effects of the first halo.

Assume that there's a halo at (x1, y1) and we're trying to calculate the signal at (x2, y2) due to the galaxy G. (Recall that the total signal at a point is based off of the individual signal that each galaxy produces.) Then, let s_g1 be the signal due to G at (x1, y1), and let s_g2 be the signal due to G at (x2, y2). Since we know there's a halo at (x1, y1), we're pretty certain that s_g2 contains some of the affects of the first halo.

Let v1 be the distance vector between (x1, y1) and the galaxy, and let v2 be the distance vector between (x2, y2), as seen in the picture. We can calculate s_g2_g1 -- the amount of s_g2 that was influenced by s_g1 -- with the following formula:

s_g2_g1 = dot(v1,v2) * s_g1

where dot is the dot product. Therefore if we subtract s_g2_g1 away from s_g2, it's like we're subtracting away the effects of the first halo!

This approach was actually fairly accurate at predicting the positions of all three halos. The drawback to it is that it didn't contain any information that quantified the effect of each halo. As a result, we weren't able to predict the number of halos in the sky at all (as we discuss in the next paragraphs). For our second iteration of this algorithm -- which we'll discuss in the next post -- we tried a different subtraction method which gave us more information about each predicted halo.

Finding the number of halos

1. Geometric relationships between halos. We had several hypotheses about the geometric relationships between halos, all of which turned out to be false.

The distance between halos in a sky with 2 halos will be significantly less than the distances between halos in a sky with 3 halos. (Nope, these distances turned out to have virtually identical normal distributions)
In skies with 3 halos, the triangle created by connecting the halos will have certain minimum area. (Again, a normal distribution -- didn't tell us anything)

2. Relationship between signal strength of halos. Given that we've predicted the position of 1 halo, and we predict the position of a second halo, what is the relationship between their signals? Our hypothesis was that when a second halo actually exists, its signal would be along the same order of magnitude as the first halo. When the second halo does not actually exist, its signal would be at least one order of magnitude less than the first halo. In fact, no such relationship existed.

While this second approach was unsuccessful, it inspired a later approach that was moderately successful, in which we looked at a different measure that quantified the strength of halos.

Tuesday, December 11, 2012

Signal: The Foundation of all our Algorithms

In the last post we referred to the signal metric, which "we really didn't understand." As subscribers to Olin College's philosophy of "do-learn," the next step was understanding the signal metric. It played a crucial role in the rest of the algorithms we developed, so this post will be devoted to an in-depth explanation of what it is.

In particular, we looked at a number known as the signal at every point in each sky. The signal is a measurement of how the ellipticities of each galaxies align with each other at a given point in a sky. Here's the basic idea behind it: all galaxies are elliptical to some extent. If the sky is free of dark matter, we'd expect that their major axes would all be pointing in random directions, as shown in Figure 1.

Figure 2: Elliptical galaxies in a sky with dark matter

Figure 1: Elliptical galaxies in a sky with no dark matter

However, in the presence of dark matter, the major axis of each galaxy will have a tendency to be tangential to the halo, as shown in Figure 2.

The signal is a unit-less measurement of how tangential all the galaxies are to a given point in the sky. It is an average of the tangential ellipticity of each galaxy relative to that given point. The ellipticity of a point (x,y) tangential to (x' y') is as follows (equations from here)

In pseudocode, this algorithm is as follows.

for each x point:
for each y point:
sig = average([e_tangential for every galaxy in this sky])

if sig > max_sig:
predicted_x = this x
predicted_y = this y
max_sig = sig

add this signal to a matrix of (X, Y, Signal) #used for plotting
predicted_halo = (predicted_x, predicted_y)

To gain a better intuition about signals, take another look at Figures 1 and 2. In Figure 2, the point at the center of the figure has a large, positive signal. In Figure 1, there is no net alignment of the galaxies. Therefore the point at the center of the galaxy would have a signal number near 0.

Figure 3: Elliptical galaxies in a sky with no halos (again)

If it appears that galaxies are aligning themselves away from a point (as they are in Figure 3, then it is very unlikely that there's a halo at that point. The center of this figure would have a negative signal number. If a sky has a small area that looked like Figure 3, it is likely there is a halo(s) elsewhere in the sky stretching the galaxies beyond their natural ellipticity.

The larger the signal is at a point, the higher the probability is that a halo exists there. If we calculate the signal at every point in a sky, we get a pretty good estimate of where one of the halos is located. For efficiency reasons, we calculated the signal at every 100 points, not at every single point. Figure 4 contains an example sky plotted on the same grid as its signal.

Figure 4: Signal for a sky with one halo.

Figure 5: Signal map of a sky with 3 halos

This sky only has one halo, marked by the star. The galaxies are represented by the black ellipses on the figure, and the color map represents the signal. As you can see, the strongest signal occurs right where the halo is. So this measure seems to work very well!

Here's the problem: we don't always only have one halo in the sky. A lot of times there's two or three halos. The signal metric doesn't account for this, since it chooses the single point with the highest signal. Take a look at Figure 5 - it clearly indicates the location of one halo, but without the blue and pink stars, it would tell you nothing about the other two halos' existence. Almost all the skies we examined looked like this, regardless of how many halos were actually in the sky. Since one halo is generally larger than the others, it overshadows the signal at the other halos.

We hoped to expand this method to work for predicting more than 1 halo. In addition we needed a solid way of predicting how many halos are in a given sky. We had a few ideas, but some turned out to be more successful than others...

Blinded by the Light, Or Lack Thereof

We didn't want our ability to compete to be hindered by our complete lack of conceptual understanding. So, our first entry to the competition was based off of algorithms provided to us by the competition.

In particular, we looked at two provided benchmark methods: gridded signal, and maximum likelihood. Gridded signal uses a measure known as the signal -- which we will talk about in the next post. (For this first attempt, we really didn't understand what the signal metric was measuring. It was only later that we understood what it was actually doing.)

Maximum likelihood is best defined by the competition itself, on this introductory blog post:

Assume a profile of Dark Matter halo and then try to fit this model to the data. From this find the most likely position of the halo. So one such model could be that the distortion caused by a Dark Matter halo has a 1/r drop off, where r is the distance from the center of the halo. This code finds the likelihood of a halo at a particular position and then assumes that the position with the maximum likelihood is the position of the halo.

Most notably (for our purposes) each of these methods only predicts the position of 1 halo. Therefore, these benchmarks needed to be expanded upon to detect multiple halos.

As a first attempt, we decided to combine both of these metrics to find multiple halos. We made a very uninformed assumption: if we use both these methods on a single sky, and their predictions are within 200 units of each other, they're probably detecting the same halo. If the two predictions are greater than 200 units apart, they're probably different halos.

Using this assumption, we made our first entry to the competition. For skies whose 2 predictions were within 200 units of each other, we determined that their was only one halo in the sky, whose position was the result of the gridded signal metric. For skies whose 2 predictions were greater than 200 units apart, we determined that there were two halos, with each prediction telling us the position of one of the halos.

With this naive method, we achieved a score of 1.63!! To us, this was a fantastic first step, and an indicator of wonderful things to come. It scored higher than the maximum likelihood benchmark*, and narrowly missed the gridded signal benchmark's score of 1.58.

One obvious area of improvement was to add some uniformity between the two methods. As defined by the competion, the max likelihood method divided the sky into a 10 by 10 grid, trying out 100 potential positions for the halo. The gridded signal method used a 15 by 15 grid, evaluating 225 potential positions for the halo. We predicted that by using a uniform grid size, results would be better. We modified the max likelihood method to use a 15 by 15 grid, and made our predictions by the same method as above: lo and behold, this solution achieved a score of 1.47505.

Sadly, this turned out to be our best "legitimate" score of the competition. However, at the time, we had no idea that this would be the case. We thought would be able to make a much better algorithm than this naive approach. And we failed. But more on that later...

* We were really pleased by this, until we realized that the maximum likelihood benchmark was awful. The maximum likelihood methods somehow is less accurate than random guessing. Being excited about beating this method is like being excited about beating a baby in chess... why get so excited about something inevitable?

To Infinity... and Beyond! Observing Dark Worlds

Fresh off our first machine learning competition, we were hungry for more. We decide to enter another Kaggle competition called Observing Dark Worlds. This competition was way more advanced than the digit recognition competition, but we were up for the challenge. Because this was an actual competition with real money at stake, we witheld all blog posts to the end of the competition...we also wrote all of them at the end of the competition. So, past tense.

Here's the problem posed by the competition:

Dark matter makes up most of the universe, but -- since it doesn't emit or absorb light -- we can't see it. However, we can detect it because it warps and bends spacetime. Light emitted from starts in a galaxy will be distorted by the time it reaches the earth, causing galaxies to appear more elliptical than they normally would. Dark matter tends to appear in clusters, which are known as halos. Our job is to locate the position of these halos, given a sky of galaxies.

Affect of Dark Matter on Shapes of Galaxies

Dark matter bending light from a background galaxy.

Note that galaxies have some natural amount of ellipticity independent of the ellipticity caused by dark matter. So, the correlation between galaxies' ellipticities and the presence of dark matter is much less obvious than one might expect. The graphic below should help you to visualize this phenomenon:

(Both figures are from the project description at http://www.kaggle.com/c/DarkWorlds)

For this competition, we were given a training data set of 300 skies, each with a number of galaxies and between 1 and 3 dark matter halos. For each galaxy in a sky, we were given its position coordinates and its ellipticity. We were not given any information about the halo(s) in each sky. Our job was to determine 1) how many dark matter halos were in the sky, and 2) the position coordinates of each halo.

And that's the problem we spent the six weeks working on. The competition included 4 benchmark methods, and our personal goal was to beat all of them. The competition had its own scoring method for all entries, and to "beat" all the benchmarks, we needed to achieve a better score than them. Like golf, a lower score was a better score for this competition.

Sunday, November 4, 2012

Our first competition, cont. (Wrapping it all up - actually)

So this is our final post on the handwriting recognition competition! It was swell, but now it’s time to say goodbye...

In summary: out of the four algorithms we tried, k-nearest neighbors was the most accurate (95%), followed by our neural network (93%). Our logistic classifier was the next most accurate (81%), followed closely by our linear average classifier (80%). We tried parameter optimization for the logistic classifier and the neural network, and received 0.2% and 2% accuracy boosts, respectively.

Did we try all we could have tried? No. If we were really serious about the performance of our algorithms, there’s a lot we could have investigated - better optimizations, performance enhancements, and more advanced techniques. For us, the purpose of this competition was to introduce us to machine learning, and this was definitely accomplished. We implemented 4 different algorithms from scratch, learned how to test algorithmic performance, and even tackled some machine learning system design with parameter optimization. I’d say our goal was more than accomplished. For our next project, we’ll be sure to try much more advanced techniques, think more carefully about system design, and try to test out any hypothesis we may have for improving our system.

Here are some major things we learned along the way:

More complicated does not mean better. We saw this with the k-nearest neighbor classifier - which was conceptually simple and easy to implement. It ended up outperforming all our other algorithms, despite other algorithms being more complex (especially neural networks).

The 80-20 rule really applies. We coded our first algorithm in under an hour, and right away that gave us 80% accuracy. The rest of our time was spent trying to improve on this. When we tried optimizing parameters, we didn’t get much improvement either. However, there’s a huge difference between 95% accuracy and 80% accuracy. Consider how the USPS uses handwriting recognition to read addresses off of letters. Imagine if their classifier only had 80% accuracy - that would be a lot of letters delivered to wrong addresses. Even if it had 99% accuracy - if they deliver millions of letters a day that’s tens of thousands of letters that get incorrectly sent. If precision matters, every tenth or hundredth of a percent counts.

Visualizing results is helpful. When dealing with big data, little bugs can go unnoticed, especially if they don’t seem to have much effect on the system. Sometimes the best solution is to slow down and watch as the computer trains or tests each sample individually. With digit recognition, it helped a lot to display the picture of each training or test sample as the algorithm classified it.

If we were to make our recognizer more accurate, here’s some things we would try:

Optimize better. We could try increasing the number of different values we test over for each parameter, or we could try optimizing over different parameters.

Image preprocessing. We tried a bit of this at the end, but without much success. To decrease the variation between samples, we could rotate each sample to be facing the same direction, remove any border on the samples, etc. Less variation means more accurate detection.

Introduce artificial training samples. More training samples means more accuracy. Often this will produce better results than trying to improve the algorithm. Though we used all the Kaggle training samples, we could have made some of our own. One way to do this is to make modified copies of the existing samples. For example, it is possible that some digits had thicker line widths than others. If we take each sample, make 3 copies of it with varying line widths and use all of these samples for training, we could multiply the our training sample size by 4. All of the artificial examples would be plausible as real examples, so this could increase our accuracy.

Well, that’s all we have on handwriting recognition. Next up: identifying dark matter halos!!!
(And in case you’re wondering, we are currently placed 429th for the handwriting recognition competition. It’s a popular one!)

Friday, November 2, 2012

Our first competition, cont. (Part 4: Neural Networks)

At this point we’ve attempted fairly simple machine learning algorithms. Time to bring out the big guns...

Neural networks are one of the most commonly used algorithms for handwriting recognition. The basic idea behind a neural network is actually fairly simple (though the implementation is fairly difficult).

Backtracking for a second: with the logistic regression, we had input variables (one for each pixel) and output variables (one for each digit 0-9). In other words, we had a layer of input variables and a layer of output variables.

Example logistic classifier

Neural networks are similar, but with a bunch of other stuff thrown in the middle. In between the input and output layers are 1 or more hidden layers, which do internal processing of the data. For a given layer, each node is connected with all the nodes in adjacent layers. The network is supposed to be similar to actual neurons in the brain.

Example neural network

Every connection in the neural network has a weight associated with it. Each layer is learned in succession, starting off with the first hidden layer and ending with the output layer. The value of a node is determined by the weighted sum of all the nodes in the previous layer, which is then run through the logistic function. It ends up being like a bunch of logistic classifiers chained together.

Similarly to the logistic regression, training the neural network consists of finding the weights that minimize a cost function. The cost function is minimized with an iterative algorithm that requires knowing the cost function gradient (where each weight is a variable). Finding the gradient requires an algorithm for this is known as back propagation, which is an incredibly complicated and unintuitive process. The implementation of it is very prone to errors, so this was the cause of much stress.

To start off, we created a neural network consisting of three layers. The first layer - the input layer - had 784 neurons, one to represent each of the 28x28 pixels in the image. The final layer - the output layer - had 10 neurons, one to represent each of the 10 digits that were being read in. This leaves the middle layer, or the hidden layer. To our surprise, the number of neurons in this layer don’t actually represent anything. The Coursera course - and other resources from a quick Google search - had some black magic rules for determining the size of this hidden layer. For starters, we used a 20 neuron hidden layer, and later we tried to find an optimal size. (Also, per suggestion of the coursera course, every layer except the output layer included a biasing neuron. This neuron, for each layer, always had a value of 1. It was connected to every neuron in subsequent layers, but there were no connections to it from the previous layer.)

We used all 30k training samples when setting up the neural network. Because the network is so complex, training takes a long time (an hour or two). Overall, our network ended up making predictions on the Kaggle test set with 91% accuracy. Not as high as we were hoping, but not bad either!

Since neural nets are supposed to be very powerful, we knew there had to be a way to improve its accuracy – and again we turned to parameter optimization. In particular, we tried varying the number of nodes in our hidden layer and increasing the number of hidden layers. We tried 100, 150, and 200 hidden layer nodes for networks with one hidden layer and two hidden layers. (In a perfect world we would have liked to try more values of the parameters. However, when it takes up to 2 hours to train a configuration, 6 different configurations is already pushing the limits of our computers). The conventional wisdom with neural nets is that bigger is better – so we predicted that more hidden layers with more nodes would improve our accuracy. However, sometimes this heuristic doesn’t hold, so we wanted to be sure we weren’t overshooting. We trained each of these configurations with 24k samples, leaving 6k samples for cross-validation. Results from cross-validation for each configuration are plotted below.

Cross-validation results from parameter optimization

As far as hidden layer size goes, bigger does mean better (for the most part). However, what is surprising is that the 2 hidden layer network was less accurate than the 1 layer network. One possible explanation for this is that the cost function minimization algorithm did not run enough iterations. Because there are a lot more parameters to optimize for a 2 hidden layer network, it would make sense that it should take longer to arrive at a true minimum. Thus the parameters that the 2 hidden layer network were trained with may not have been optimal.

Using the most accurate network configuration (1 hidden layer with 200 nodes), we obtained 93% accuracy with the Kaggle test set. That’s a 2 percent accuracy boost from our first neural net attempt. It’s a bit sad that the simple k-nearest neighbor algorithm worked better than this more complex one, but perhaps neural networks would be better if we did further optimizations to increase accuracy. There are other parameters we could choose to look at for optimization – such as the regularization parameter. We could also run cost function minimization for more iterations to ensure that our weights have converged on their optimal values (especially useful for the 2 hidden layer network). Nevertheless, the simplest way to improve accuracy would be to have more training samples. 30k is definitely a lot, but neural networks perform even better when there’s on the order of 100k samples.

Check out our neural network implementation here: https://github.com/rachelbobbins/machinelearning/tree/master/neural_network

Well, that covers all the handwriting recognition algorithms we tried. In our next entry we’ll recap and talk about future steps. Then it’s onto our next project!

Thursday, November 1, 2012

Our first competition, cont. (Part 3: k-Nearest Neighbors)

This algorithm was a last-ditch* attempt to break the 90% success rate as we wrapped up our effort on the digit recognition challenge. The kNN benchmark on Kaggle had a 96.667% success rate, and it used an algorithm not covered in Coursera. So, doing my own implementation seemed like a) a way to learn about a new algorithm and b) a likely way of beating the 90% mark.

The k-Nearest Neighbor (kNN) algorithm is surprisingly simple for such an accurate classifier. Given a a test sample "t", it:

1. Determines the euclidean distance between t and every individual training example.
2. Finds the k closest training examples, and their known values/labels (in this case, 0-9)
3. Classifies "t" as the label of its mode of the k nearest neighbors.

So, for sample "t" and k=10, let's say the nearest training examples have labels [1, 1, 2, 1, 1, 7, 1, 1, 0, 2]. The algorithm then labels "t" as 1.

We ran the algorithm with an arbitrarily chosen k=15. It took forever (~16 hours) to run on my MacBook Air, with an Intel i5 processor and 4 GB of RAM. It was 95.8% accurate. This was slightly less accurate than Kaggle's success rate, and probably could have been attributed to the fact that they used a different value of k (10).

Curious about how k's value would impact the algorithm's accuracy, I wrote an optimization function that would try values of k from 1-20, and return the value of k which gave the highest accuracy. It split the training data into two groups - one for training, and one for cross validation. It used 28k training examples to predict the label for 2k training examples, then compared its predictions to the actual labels. Surprisingly, different values of k did not impact the algorithm's accuracy - it generally achieved between 95 and 97% accuracy. Because the kNN algorithm is so time-intensive, I did not run it again using the best k.

(*Chronologically speaking, we did this after working on a neural networks approach... but that post hasn't been published yet)

Sunday, October 28, 2012

Our first competition, cont. (Part 2: Logistic Classifier)

If you missed our last entry - we just finished competing in a digit recognition competition on Kaggle. Our first attempt at digit recognition used a very simplistic algorithm (linear average) which resulted in 80.4% accuracy. If a simple algorithm could achieve this accuracy, imagine what a more complex algorithm could achieve! This bring us to...

Approach #2: Logistic Classifier

The logistic classifier was one of the first algorithms we learned in the coursera course. It is a pretty simple algorithm, but it can be pretty powerful if used in the right application. Logistic classifiers are explained in detail on Coursera. To gain some intuition as to how they work and why they are useful, check out this page.

Turns out logistic classifiers aren’t very great for handwriting recognizers. But before we get to the results, a little bit about how they work...

Logistic classifiers are made of a series of logistic regressions. A logistic regressions takes a series of weighted input variables (which for our purpose are each of the 784 pixels plus a biasing term) and uses the logistic function to classify samples into one of two categories - either yes or no. Since we wanted to classify samples into 10 categories - one for each digit - we ended up using 10 separate regressions. One would determine the probably of the sample being a zero or not, one would determine the probability of being a one or not, etc. The number that produced the highest probability would be the number assigned to that sample. This is known as the “one versus all” approach.

When it comes to assigning weights to input variables, the same technique is used for almost every machine learning algorithm. For any given set of weights, we assign the classifier a “cost” that represents the error. A cost function of 0 would mean that that the weights perfectly classify samples every time, whereas a cost of infinity would mean that the weights classify samples incorrectly every time. (If you think about it, both cases are pretty impressive).

For the logistic classifier, the cost is determined by the following function:
Without getting into too much detail, x represents the values of each pixel of the image, y represents the digit that the image is supposed to correspond to, h_theta represents weights and the logistic function, and m represents the number of training samples. In essence, what this function is doing is seeing on average how far off the predictions are from the actual values. Therefore, we want to find a set of weights that minimizes this cost function.

(In our final version of the classifier, we used a slightly fancier cost function that included something known as a regularization term. This is used to prevent data overfitting - which is when your weights work perfectly for your training set but then fail with your test set. You can read more about it here)

Here’s the basic idea behind finding the right set of weights: first you randomly guess the weights and calculate the cost function associated with the weights. You then improve this guess by finding the gradient of the cost function (with respect to the weights). The direction of the gradient tells you how you should change the weights in order to lower the cost function. After repeating this process a number of times, you will converge on values that minimizes the cost.

We trained the classifier with all 30k training samples and then used it to predict the digits for all 28k Kaggle test samples. Our algorithm had an accuracy of 81.2%, which, surprisingly, was not much better than we got with the overly-simple linear average algorithm. For all the extra complexity of the logistic classifier, we only gained a 0.8% accuracy boost!

When we did a similar digit recognition exercise on Coursera and got much higher accuracy with a logistic classifier (around 90%), so we looked for areas of improvement. We chose to optimize two parameters in our model that we chose somewhat arbitrary. The first was a parameter associated with the cost function’s regularization term. The second parameter was the number of iterations used for our cost minimization algorithm. (The more iterations we ran, the more accurate the weights became, but only by a marginal amount). We trained 22.5k samples with different combinations of these parameters, and then determined which parameter combination was the most accurate by testing each set of weights on a cross-validation set of 7.5k samples.

After finding the optimum weights, we made predictions for the 28k Kaggle test samples. Our accuracy did not improve much - only by 0.2%. This is not much improvement (especially since it took our computers several hours to find the optimal parameter set!), and we are still far off of the 90% accuracy we achieved with the Coursera exercise.

One hypothesis we have for this weaker performance is that the images in the Coursera dataset were smaller than the Kaggle images. The Coursera images were 20x20 pixels, which comes out to nearly half of the pixels in the Kaggle images. Though this means that there are fewer variables to use (recall that each pixel is a variable), there will also be less variation between images that represent the same number. So if the Kaggle dataset was made of smaller images, we probably would have much better accuracy.

Check out our implementation here on our Github repo: https://github.com/rachelbobbins/machinelearning/tree/master/logistic_classifier