What you will learn and not learn with Kaggle style competitions

Competitions around AI and machine learning in general have become more and more common during the past five years. In these competitions you are generally given two sets of data – a training set and a testing set – and you are asked to submit your predictions on the testing set data. In this post I want to talk about how these competitions work, what you can learn from them and, most importantly, what you will not be able to learn from them.

Image result for competition cartoon

These competitions are generally created by industry participants and they are meant to increase their ability to carry out some particular task. For example netflix created a competition to enhance their ability to classify whether a user would like or not like a certain piece of media, other examples include predicting kinetic constants in drug development, fraud in credit card transactions or daily returns in financial markets.

You can participate in these competitions for free and they are great to get you to practice the basics of machine learning problems with some real – yet already curated and built – data sets. With these data sources you can generally learn to tackle basic problems in data cleaning, like dealing with missing values, outliers and you can learn a lot about the basic technical aspects of model creation, including how to do things like cross-validations, training, predicting, etc. They are also great to give you practical experience with the most commonly used and powerful machine learning algorithms/libraries and how to actually use/tune them in practice.

The problem with these competitions is their focus on the “last mile” part of machine learning problems. Generally most people in these competitions will reach the 95th percentile of the scoring metric very quickly but there might be 1000 places between the 95th and 99th percentile. Imagine you’re evaluated with the accuracy of your predictions and the best contestant is making predictions with a 75% accuracy, an experienced person in machine learning will probably get to 71% very quickly but it will be a struggle to get to that 75%.

In real life, the “last mile” problem is very rarely the problem that gets the most attention. An algorithm making a 71% prediction might be just as good as an algorithm that makes a 75% accurate prediction, specially if getting to that higher accuracy would imply spending 5x more resources. In most cases extracting this last increase in the goal metric is not going to be cost-efficient, unless the problem is extremely valuable from an economic perspective. The dollar cost of having a senior machine learning engineer or data scientist tackle the “last mile” problem is just not worth it for most companies.

Under real life circumstances the most important factors are often not related with “how good” the algorithm is but instead on how practical it actually is. This means thinking about how easy it is to get the data to run it, how easy it is to retrain and – very commonly – how easy it is to deploy. It might be more desirable to use an algorithm that is simple to deploy and retrain – that delivers 95% of the prediction score of the best state-of-the-art solution – than to use a far more complicated algorithm that gives you a little bit more at the expense of a lot of additional time and complexity.

A snapshot of the leaderboard for a kaggle fraud detection competition. Position 1000 is already at 98% of the value of the top score.

Another key aspect of real life model building is also data gathering and curating. In a kaggle-style competition the decision of what data has been gathered and how it is curated has already been made for the most part, while in real life deciding exactly what data to measure, use and take into account is also a critical part of the model development process. It is usually way easier to get better gains by choosing better variables to measure than it is to use a set of sub-optimal values to attempt to extract more information. Feature selection – in terms of what information is extracted from the real world to make predictions – is perhaps the most critical aspect when building models.

In the end, these competitions are a great tool for someone who is new to data science – or for anyone who enjoys the “last mile” problem – but they are not going to make anyone, by themselves, a real-world data scientist. They are also great to have as a hobby, especially if you want to stay in touch with the latest python/R libraries for machine learning.

Leave a Reply

Your email address will not be published. Required fields are marked *