Archive | November 2012

Five software suggestions for Machine Learning

I talked about why we need machine learning in the previous post, but this thing sounds a bit tough to be solved in an afternoon with paper and a pencil, don’t you think?

I’d like to talk in this post about the available software to help solve machine learning problems. There are solutions to cater all different needs, so I will go through them briefly so you can familiarize with the one you need more as soon as possible!

There are several programs and languages that can handle different the different algorithms that we’ll review in the next posts:

– Matlab: This proprietary software is a standard in universities and businesses due to its versatility and power. Easy to use and learn, the code in this blog will be almost entirely written for this software. There are free alternatives to Matlab, being the most compatible and powerful Octave.

– R: The real open source alternative to Matlab in statistics. Not compatible with the former, R is very used in academia with very successful results. Lacks a good GUI, but it is a masterpiece.

– Python: Powerful, reliable (and free) libraries have been released specifically for scientific computations and machine learning (NumPy and scikit-learn are good examples). Efficient memory usage and competitive results and computation time makes Python very appealing for serious work to be later deployed even in the market.

– SAS-based: I’m not very familiar with this software family myself, I have to admit. But it is certainly mostly used in corporate environments due to its simplicity and visualization capabilities. Most of the visual results nowadays are generated with this software (and some variants).

– Julia: It shouldn’t be here yet, as it is not as powerful, well-known or even well-suited to machine learning at this moment. Nevertheless, it’s a very promising language, and its versatility and growth in the past months make me suggest this software. A really worthwhile idea.

Consequently, there is no perfect software for everyone, so you will have to choose. I work with Matlab due to its simplicity in its language, but I am actually considering moving partially to Python to avoid memory usage restrictions. Now it’s your time to try them and choose one to start programming!

We lose many things simply out of our fear of losing them.

Paulo Coelho, Brida


How ML won the US Presidential Elections

2008 US Presidential Elections were won by the insertion of technology into the campaign (first into the Democratic primary elections and afterwards in the Presidential ones). The effect of Twitter and many other social media tools were exploited by the winner to reach all potential voters. A good communication policy and effectiveness when using these powerful massive tools did the rest.

2012 elections went well further into this technology exploitation. A Time Magazine story, published yesterday, details how every detail was taken into account when selectively choosing a celebrity to host an event, or detecting people’s mood about nearly anything in their lives. One can even say the candidate could actually hear everything we said outloud to be heard. It scares me, I have to admit. But it’s also the extreme position where the politician actually listens to the people.

I’ll talk in several posts about Big Data later on, but for now, if you haven’t heard about it, it’s time you know something about it. The term Big Data changes with time and computational capabilities, but let’s say it’s the use of massive information data in machine learning as the way to solve a problem. Algorithms have to be changed, and also the resources needed for this task, but the efficient use of this information provided Obama’s team a real-time map of the present state of the country in nearly any matter. Then, they could even obtain this map for a certain demographic, geographic or social division. Can you imagine the potential of it?

The fact that even last day polls failed almost completely to predict the final result (in terms of the number of representatives which finally lead to the election of a president). The image displayed in this post was taken from a well-known poll aggregator, RCP, the day before the elections (you can click here to go to the actual page). The final result provided a much wider advantage for the Democrats.

The question is, why? I can only guess, but the first reason that comes to mind is that all polls are somehow biased by the polling company owner interests, or by the primary client (usually a party or one of its ‘white brands’). A more technical approach to the problem is that people sometimes lie in these situations, therefore affecting final accuracy. But, if you ask me seriously what happened here, I would answer the polls didn’t reflect what people thought at that moment, only what they answered when they were asked about politics.

On the other hand, I am not saying regular polls don’t have information. Nate Silver has shown consecutively (2008 and 2012) that, by using polling history (and individual polling past accuracy), one can predict the final result almost certainly. In fact, he succeeded in 49/50 states analysed in 2008 (the so-called toss-up states), and in 50/50 states this week (Florida hasn’t finished vote count, but up to now – 98% votes counted – it’s also a success for Mr. Silver). It’s not that polls are entirely wrong, is it?

The conclusion here is not about politics. It is about machine learning, and how can be used in a real problem, in a very complex environment such as an entire society. And how to succeed when used correctly, competently.


Be careful what you wish for because it might come true

Aesop, Fables