If there is one big difference between theoretical and real world exercises in data science – especially as it pertains to machine learning – it’s the presence of a much larger and often more pervasive array of bias sources. The effect of these biases can go from making predictions slightly worse, to making them extremely discriminatory under real life scenarios.
Take for example this credit card fraud dataset available from kaggle. This set basically provides us with unnamed variables and how they relate to a transaction either being fraudulent or not. We have no idea of what these variables represent and all we care about is to maximize how correctly or incorrectly we can identify whether a transaction is fraudulent. A data scientist focused solely on the metrics of machine learning success would have absolutely no problem with this, they just need inputs and outputs and they’ll tell you how you should use variable X to predict credit card fraud because that’s what gives the best [insert performance metric here].
Now what if this variable happens to be the distance between the origin of a transaction and the closest low income neighborhood? You’ve just told the credit card company to discriminate against poorer people. I tend to refuse work on data that has no description due to this exact reason. There is both a practical and moral decision aligned with the use of data that must go into data mining when solving real world problems. Not all variables will be good to describe credit card fraud, but not all variables that are mathematically good at describing credit card fraud should be used for this purpose. The reasons to use or discard data can and should go beyond math.
There are also cases where the moral implications are smaller but the practical implications are bigger. For example if a company wants to create a machine learning algorithm to predict whether a machine will fail they might be tempted to add as much measuring equipment as they can to solve the problem. They might take 2 million measurements per hour to try to solve the problem of a machine failing once every 3 months (after all the more the merrier right?). Here the company has just potentially added a huge amount of noise to the data that might make the creation of truly useful predictions just much harder than it should be. More data is not usually better, it can be much worse. If any addition of data was better we would just use random number generators to add new inputs to our models all day long.
The problem of what to measure is just as important as building the predictive models. A data scientist must be able to address the problem of what data needs to be gathered and used, not only the problem of how to turn the data received into a predictive model. This is a much harder problem, that often requires some domain expertise, which is probably why people with broad understanding abilities – such as physicists – tend to make better data scientists for industry compared with people with purely IT or mathematical skills. In the end predictive models are models about the world and to decide what aspects of the world need to be measured is perhaps more important than deciding how to use those measurements to create models.
The nature of the problem can also create substantial sources of bias. For example time-dependent data might be treated by a naive data scientist in the same way as time-independent data. With time-dependent data it is usually not a good idea to employ traditional cross-validation techniques – because you might be using information about the future that was never available in the past – which can then lead to models that are terrible at predicting new events in the data series. Sports betting is a clear example where this happens, try building a model with traditional cross-validation and see how you do (just don’t bet anything!).
The above are only a handful of cases but I hope they highlight how real world problems are often more complicated than the traditional (input|output -> model -> result) that can be common among people who are newer to the field. If you want to get better at real world data science skills it is important to become much better at identifying sources of bias. I certainly still have a lot to learn about this but just being aware of these problems is a huge step forward.