Rachel + Geoff Study ML: Blinded by the Light, Or Lack Thereof

We didn't want our ability to compete to be hindered by our complete lack of conceptual understanding. So, our first entry to the competition was based off of algorithms provided to us by the competition.

In particular, we looked at two provided benchmark methods: gridded signal, and maximum likelihood. Gridded signal uses a measure known as the signal -- which we will talk about in the next post. (For this first attempt, we really didn't understand what the signal metric was measuring. It was only later that we understood what it was actually doing.)

Maximum likelihood is best defined by the competition itself, on this introductory blog post:

Assume a profile of Dark Matter halo and then try to fit this model to the data. From this find the most likely position of the halo. So one such model could be that the distortion caused by a Dark Matter halo has a 1/r drop off, where r is the distance from the center of the halo. This code finds the likelihood of a halo at a particular position and then assumes that the position with the maximum likelihood is the position of the halo.

Most notably (for our purposes) each of these methods only predicts the position of 1 halo. Therefore, these benchmarks needed to be expanded upon to detect multiple halos.

As a first attempt, we decided to combine both of these metrics to find multiple halos. We made a very uninformed assumption: if we use both these methods on a single sky, and their predictions are within 200 units of each other, they're probably detecting the same halo. If the two predictions are greater than 200 units apart, they're probably different halos.

Using this assumption, we made our first entry to the competition. For skies whose 2 predictions were within 200 units of each other, we determined that their was only one halo in the sky, whose position was the result of the gridded signal metric. For skies whose 2 predictions were greater than 200 units apart, we determined that there were two halos, with each prediction telling us the position of one of the halos.

With this naive method, we achieved a score of 1.63!! To us, this was a fantastic first step, and an indicator of wonderful things to come. It scored higher than the maximum likelihood benchmark*, and narrowly missed the gridded signal benchmark's score of 1.58.

One obvious area of improvement was to add some uniformity between the two methods. As defined by the competion, the max likelihood method divided the sky into a 10 by 10 grid, trying out 100 potential positions for the halo. The gridded signal method used a 15 by 15 grid, evaluating 225 potential positions for the halo. We predicted that by using a uniform grid size, results would be better. We modified the max likelihood method to use a 15 by 15 grid, and made our predictions by the same method as above: lo and behold, this solution achieved a score of 1.47505.

Sadly, this turned out to be our best "legitimate" score of the competition. However, at the time, we had no idea that this would be the case. We thought would be able to make a much better algorithm than this naive approach. And we failed. But more on that later...

* We were really pleased by this, until we realized that the maximum likelihood benchmark was awful. The maximum likelihood methods somehow is less accurate than random guessing. Being excited about beating this method is like being excited about beating a baby in chess... why get so excited about something inevitable?

Rachel + Geoff Study ML

Tuesday, December 11, 2012

Blinded by the Light, Or Lack Thereof

No comments:

Post a Comment