Intelligently dealing with missing values: Just ask why

If one thing is common to all data scientists, it’s the need to deal with missing values. Regardless of the field you work in or the problems you tackle in your day-to-day life, there will always be a time when you encounter that “NaN” in your database. Dealing with missing values is a very important problem in data science – arguably an incredibly important one – and it is therefore up to us to find out what the best way to deal with it is. On today’s post I want to talk about a few ways to deal with missing values and, specifically, how missing values themselves represent information and how we can capture this.

Image result for missing data cartoon

Missing values, those empty slots in your database, can happen due to a wide variety of reasons. Understanding these reasons is critical to the successful solution of the problem, since the missing values might contain information themselves. Is the data point missing a matter of its existence being impossible or is it just a matter of the evidence not have being collected? Is it random, is it not?

Imagine a sensor that registers temperature every minute, however, it can fail to record a reading for two different reasons. The first is a glitch in the program code that causes the recording process to randomly fail 5% of the time, the second is the temperature going above the largest reading that the sensor can register. Within the “random” mode of failure, the missing record could be safely interpolated, within the second mode of failure, an interpolation could be dangerous. Both are different, both cannot be treated the same way, yet all we have is NaN values within our database.

In real life, understanding the nature of the variables is very critical to the filling of missing values due to this precise reason. In many cases the problem can be solved by filling NaN values with interpolations, averages, medians, modes, etc. It’s not about jumping across different ways to fill and pick whichever mode leads to the “better predictive model” after the data has been filled, it’s a matter of understanding why a certain filling mechanism might work better than another.

The presence of the missing data itself might contain information. In our temperature example we can start understanding the problem by hot encoding a variable representing whether data is missing or not (a new binary vector), we could then try to build a predictive model of whether or not we’re missing data using a lagged version of our initial data series. If we use a model like a decision tree we will clearly see a split that indicates failure when temperature has reached high points, which would hint to a predictive mode of failure instead of a random one being dominant.

The first question you should ask is not “how can I fill the missing data?”, it should be “why am I missing the data?” the fact that you’re missing the data might tell you important things, doing hot encoding analysis of missing data points can be very enlightening, not only in time series analysis but in all different sorts of data.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsRegressor

df = pd.read_csv("inputs/train_identity.csv").set_index("TransactionID")

df0 = pd.concat([df["id_03"], df.drop(["id_03"], axis=1).dropna(axis=1)], axis=1)

le = LabelEncoder()
c = "id_12"
df0[c] = df0[c].fillna("NaN")[c].unique())
df0[c] = le.transform(df0[c])

train = df0.dropna()
x_train = train.drop(["id_03"], axis=1)
y_train = train["id_03"]

x_test = df0[np.isnan(df0["id_03"])].drop(["id_03"], axis=1)

m = KNeighborsRegressor(), y_train)

df0.loc[np.isnan(df0["id_03"]), "id_03"] = m.predict(x_test)

Think about intelligently filling the missing data if you have other clean inputs to rely on, the literature is filled with techniques to help you intelligently predict missing data using machine learning techniques. The above code shows you a (very) simplified example of how to do this using a KNN regressor in sklearn. For this exercise I used a data column (“id_03”) from the fraud database available from kaggle here. This dataset looks like swiss cheese, so it is a good example to practice filling data using ML methods.

A predictive approach to data filling is useful, as it might actually allow us to do smart data filling in a non-linear manner. These methods usually work way better than naive approaches – like using a mean or a median – if there are enough additional data columns to support the prediction. However adding hot-encoded columns with the information of when things were missing in the first place can also be very valuable as just the fact that some data point is missing might actually point to something very useful. For example, a missing timestamp in a log entry database might be extremely indicative of tampering, while filling it – in any way whatsoever – completely removes the information.

so, WHY are you missing that data?

Leave a Reply

Your email address will not be published.