Rachel + Geoff Study ML: A New Hope

At this point in the competition, we had a pretty good method for predicting the position of halos. However, we had no accurate way of predicting the number of halos.

We realized that what we needed was an algorithm that can tell us more about each predicted halo so that we can determine if each predicted halo is real or not. So we switched gears and tried a new position predicting algorithm -- one that came to be known as the "better subtraction" algorithm, since it was a revision of the "subtracting the effect of known halos" method

Here's the basic idea behind "better subtraction": it predicts the first position of the halo using the signal method. It then "subtracts away" the effect of the first halo on every galaxy. The resulting sky is a good approximation of what the sky would look like without the first halo. The cycle of predict-and-subtract is repeated 4 more times, for a total of 5 predicted halo positions. Afterwards, we predict the number of halos (n) that exist in the sky, and the first n (out of 5) halos stand as our prediction

An in-depth look at "better subtraction"

So what does it mean to "subtract away" the effect of the first halo? Well, we know that halos cause galaxies to be more elliptical. If we can determine how much the first halo makes a galaxy elliptical, then we can modify the ellipticity of the galaxies to "remove" this halo!

To figure out how much to subtract away, we first create a copy of the sky in which all galaxies are perfect circles. We then figure out how elliptical the galaxies would be in this sky if the first halo was inserted into it. We'll define the amount of ellipticity due to the first halo with the components e1_h1 and e2_h2 (For a reminder of what these two coordinates mean, go here). Here's some facts that we know:

The amount that each galaxy is affected is proportional to 1/r, where r is the euclidean distance between the galaxy and the halo.
If each galaxy was naturally circular and there was only the one halo in the sky, than each galaxy would appear tangential to the halo.
We know that the tangential component of ellipticity can be calculated by the following equation:

-(e1_h1 * cos(2*phi) + e2_h1 * sin(2*phi))

where phi is the angle between the halo and the galaxy with respect to the x-axis.

Based on these two facts, we can predict e1_h1 and e2_h2 are with the following equations:

e1_h1 = (m / r) * -(tan(2*theta) sin(2*phi)) + cos(2*phi))

e2_h1 = e1_h1 * tan(2 * theta)

where theta is the angle of the galaxy's major axis with respect to the x-axis, and m is the relative strength of the halo.

There's just one problem: we don't know what m -- the relative strength of the halo -- is... We only know its location and the signal at its position in the sky.

However, we can get a really good guess for what m should by guessing and checking. Let's say the coordinates of the first halo are (x_h1, y_h1). If we remove the effects of the halo, then we can expect the signal at (x_h1, y_h1) to be 0.* If there's nothing at (x_h1, y_h1) that's affecting the ellipticity of the galaxies, then there should be no net orientation of galaxies about that point!

Alright, with all that information combined, we now have a new algorithm, which is written out in pseudocode below:

make an initial guess of m
calculate signal at (x_h1, y_h1)
while (signal at (x_h1, y_h1) != 0):
if the signal is > 0:
make m greater (halo is stronger)
if the signal is < 0:
make m smaller (halo is weaker)
for each galaxy in the sky
calculate (e1_h1, e2_h2) given m
subtract (e1_h1, e2_h2) from (e1, e2)
calculate signal at (x_h1, y_h1)

What this code is doing, in essence, is guessing what m is until the signal at x_h1, y_h1 is 0. Once we find m and "subtract away" the first halo, we can repeat the same process to predict where a second halo would be, and a third halo, etc. This algorithm turns out to be about as accurate as our last one at predicting halo positions. It can predict the positions of the first two halos with high accuracy, but is not so great at predicting the third. Given the position of halos, it scored a 1.08, which is about the same as our last algorithm. So not much of an improvement with predicting halo positions...

Using position to predict quantity

Here's the really cool thing about this algorithm -- for each predicted halo, we get a number -- m -- that represents how strong the halo is. This is a new measurement that we didn't have before! And this statistic provides us with useful data for predicting how many halos actually exist in the sky.

Here's what we can do with m: if m is a large number, we expect it's more likely that the predicted halo actually exists, because it is unlikely that such a strong source would be the product of randomness. If m is small, it is much less likely that the predicted halo actually exists (although it could just be a smaller halo).

Because the halos varied in strength, we chose to look at the ratio between m-values for the predicted halos. Let m₁ be the strength of the first predicted halo, let m₂ be the strength of the second predicted halo, etc. If the sky has only 1 halo, we expect the ratio of m₂:m₁to be low, since m₂represents the strength of a non-existent halo. Furthermore, we expect the ratio of m₃:m₂to be approximately 1, because they represent equally non-existent halos. Generally, in a sky with n halos, we predict that m_n+1:m_n will be much less than 1, and that m_n+2:m_n+1 will be approximately 1.

To see this idea in action, let's take a look at the two CDFs below. The CDFs show the ratios m₂:m₁ and m₃:m₂ for a bunch of skies that only contain one halo:

The m₂:m₁ ratios are all less than 1, and the m3:m2 ratios follow a more normal distribution, with a mean near 1. This is what we would expect given that there's only one halo in the sky. Science works.

Here's a pesudocode implementation of our method for predicting the number of halos in the sky:

n = 1
while True:
calculate m_n and m_n+1 (do this by predicting the position of the nth halo and the n+1th halo)
ratio_n = m_n+1 and m_n
if (ratio_n is approx 1)
guess that there are n-1 halos in the sky
break loop
else
n++

So at this point we had an algorithm for predicting the positions of halos, and an algorithm for predicting the number of halos as well! We combined both of them and submitted another guess to Kaggle...

... and got a score of 1.49. Worse than our first attempt :/ (Remember that our first attempt was just blindly slapping together two benchmark algorithms).

As we continued to focus on predicting the number of halos, we finally figured out a way to apply a traditional machine learning algorithm: k-nearest neighbors! This seemed like the best ML algorithm to use, because the training set wasn't big enough for a neural network or other classifiers. Using the ratios m₂:m₁, m₃:m₂, m₄:m₃, m₅:m₄ as "features", and our training set of 300 halos, we classified each sky as either a one halo, two halo, or three halo sky. Our score on the test data didn't improve at all.

At this point, we tried doing a bunch of optimizations to improve our score. Here are some other things we looked at

Subtracting away only a portion of the galaxies, and not the whole thing. We optimized over the training data to determine how much we should subtract away.
Using probabilistic ratios to inform our predictions. In the training set, and the test set, we knew that 1/3 of the skies had one halo, 1/3 had two, and 1/3 had 3. We incorporated this information into our algorithm.
Detecting anomalies. When we predicted the position of halos, a number of them appeared to be on the edge of the sky. We thought that these predictions might be anomalies, and we threw many of them out.

After trying all of these methods, our score didn't improve at all. And that's when we decided to give up.

Looking back, we think there were two main reasons our algorithm for predicting the number of halos didn't work very well.

Lack of training data. It's hard to machine learn with only 300 training samples.
Shaky position-predicting algorithm. Our algorithm for predicting the number of halos uses data from the position-predicting algorithm. Our position-predicting algorithm -- "better subtraction" -- was highly accurate at predicting the positions of the first and second halos, but not nearly as accurate in predicting the third. Therefore, it was hard to tell whether the predicted third sky was an actual sky or not. We confirmed this by looking at skies with the same number of halos. When there was only one halo in the sky, our algorithm predicted the number of halos correctly about 80% of the time. However, for two or three halos, this accuracy dropped down close to 40% -- not much better than random guessing.

In summary, if we wanted to improve our overall accuracy, our best bet would be improving our position-predicting algorithm. However, by this point we were close to the end of the semester, and we didn't really have time to try any new methods. And thus ended our dark matter quest.

* In reality, the signal at that point won't be zero, because other halos in the sky will be affecting the ellipticity of the galaxies. However, this simplifying assumption proved to be fairly accurate.

Rachel + Geoff Study ML

Wednesday, December 12, 2012

A New Hope

An in-depth look at "better subtraction"

Using position to predict quantity

No comments:

Post a Comment