Wednesday, December 12, 2012

The black hole where code goes to die

Over the next few weeks, we tried a lot of approaches that did not work so well. The purpose of each of these approaches was to determine either a) the number of halos, or b) the positions of the halos, but not both. Our goal was to find a solution to each of these problems independently, then combine them into a single sophisticated algorithm. Below is a list of a few techniques that failed us:

Finding the positions of halos

1. Optimizing grid size in the gridded signal benchmark. Recall the gridded signal benchmark (provided by Kaggle) that contributed to our first submission in the competition. It works by dividing the sky up into a 15x15 grid, where each grid box is a 280x280 unit square. It calculates the signal at the center of the box based only on the galaxies contained by the box, and the box with the highest signal is deemed to be the one containing the halo.

We wanted to apply the techniques we'd learned from Coursera, and this benchmark lent itself well to parameter optimization. We tried several grid sizes (from 5x5 to 30x30, in increments of 5), and evaluated each one on the training data. The 25x25 grid worked best, but not by a significant margin. We applied the gridded signal method to the test data using this parameter, and achieved a score of 2.52128 when we submitted the results to Kaggle.

This shockingly terrible score (using its default parameters, the benchmark had a achieved a 1.58) was an artifact of how Kaggle scores competitors: it only analyzes 25% of a solution while the competition in progress; it analyzes the other 75% after the competition is over. The purpose behind this is to stop people from gaming the system - understandable, but frustrating. Given that the test data only has 120 skies, the score is calculated based on 30 predictions. It's simply not all that informative.

2. Subtracting the effect of known halos. This approach was our most original, and most successful. So successful in fact, that the entire next post will be devoted to it. Our first iteration of this method, however, was not so successful. Here's the basic idea: we calculate the position of the first halo, based off of the signal measurement. To find the position of the second halo, we calculated a modified signal which ignores off the effects of the first halo.

Assume that there's a halo at (x1, y1) and we're trying to calculate the signal at (x2, y2) due to the galaxy G. (Recall that the total signal at a point is based off of the individual signal that each galaxy produces.) Then, let s_g1 be the signal due to G at (x1, y1), and let s_g2 be the signal due to G at (x2, y2). Since we know there's a halo at (x1, y1), we're pretty certain that s_g2 contains some of the affects of the first halo.


Let v1 be the distance vector between (x1, y1) and the galaxy, and let v2 be the distance vector between (x2, y2), as seen in the picture. We can calculate s_g2_g1 -- the amount of s_g2 that was influenced by s_g1 -- with the following formula:
s_g2_g1 = dot(v1,v2) * s_g1
where dot is the dot product. Therefore if we subtract s_g2_g1 away from s_g2, it's like we're subtracting away the effects of the first halo!

This approach was actually fairly accurate at predicting the positions of all three halos. The drawback to it is that it didn't contain any information that quantified the effect of each halo. As a result, we weren't able to predict the number of halos in the sky at all (as we discuss in the next paragraphs). For our second iteration of this algorithm -- which we'll discuss in the next post -- we tried a different subtraction method which gave us more information about each predicted halo.

Finding the number of halos

1. Geometric relationships between halos. We had several hypotheses about the geometric relationships between halos, all of which turned out to be false.
  • The distance between halos in a sky with 2 halos will be significantly less than the distances between halos in a sky with 3 halos. (Nope, these distances turned out to have virtually identical normal distributions)
  • In skies with 3 halos, the triangle created by connecting the halos will have certain minimum area. (Again, a normal distribution -- didn't tell us anything)

2. Relationship between signal strength of halos. Given that we've predicted the position of 1 halo, and we predict the position of a second halo, what is the relationship between their signals? Our hypothesis was that when a second halo actually exists, its signal would be along the same order of magnitude as the first halo. When the second halo does not actually exist, its signal would be at least one order of magnitude less than the first halo. In fact, no such relationship existed.

While this second approach was unsuccessful, it inspired a later approach that was moderately successful, in which we looked at a different measure that quantified the strength of halos.



No comments:

Post a Comment