If one thing is common to all data scientists, it’s the need to deal with missing values. Regardless of the field you work in or the problems you tackle in your day-to-day life, there will always be a time when you encounter that “NaN” in your database. Dealing with missing values is a very important problem in data science – arguably an incredibly important one – and it is therefore up to us to find out what the best way to deal with it is. On today’s post I want to talk about a few ways to deal with missing values and, specifically, how missing values themselves represent information and how we can capture this.

Image result for missing data cartoon

Missing values, those empty slots in your database, can happen due to a wide variety of reasons. Understanding these reasons is critical to the successful solution of the problem, since the missing values might contain information themselves. Is the data point missing a matter of its existence being impossible or is it just a matter of the evidence not have being collected? Is it random, is it not?

Imagine a sensor that registers temperature every minute, however, it can fail to record a reading for two different reasons. The first is a glitch in the program code that causes the recording process to randomly fail 5% of the time, the second is the temperature going above the largest reading that the sensor can register. Within the “random” mode of failure, the missing record could be safely interpolated, within the second mode of failure, an interpolation could be dangerous. Both are different, both cannot be treated the same way, yet all we have is NaN values within our database.

In real life, understanding the nature of the variables is very critical to the filling of missing values due to this precise reason. In many cases the problem can be solved by filling NaN values with interpolations, averages, medians, modes, etc. It’s not about jumping across different ways to fill and pick whichever mode leads to the “better predictive model” after the data has been filled, it’s a matter of understanding why a certain filling mechanism might work better than another.

The presence of the missing data itself might contain information. In our temperature example we can start understanding the problem by hot encoding a variable representing whether data is missing or not (a new binary vector), we could then try to build a predictive model of whether or not we’re missing data using a lagged version of our initial data series. If we use a model like a decision tree we will clearly see a split that indicates failure when temperature has reached high points, which would hint to a predictive mode of failure instead of a random one being dominant.

The first question you should ask is not “how can I fill the missing data?”, it should be “why am I missing the data?” the fact that you’re missing the data might tell you important things, doing hot encoding analysis of missing data points can be very enlightening, not only in time series analysis but in all different sorts of data.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsRegressor

df = pd.read_csv("inputs/train_identity.csv").set_index("TransactionID")

df0 = pd.concat([df["id_03"], df.drop(["id_03"], axis=1).dropna(axis=1)], axis=1)

le = LabelEncoder()
c = "id_12"
df0 = df0.fillna("NaN")
le.fit(df.unique())
df0 = le.transform(df0)

train = df0.dropna()
x_train = train.drop(["id_03"], axis=1)
y_train = train["id_03"]

x_test = df0[np.isnan(df0["id_03"])].drop(["id_03"], axis=1)

m = KNeighborsRegressor()
m.fit(x_train, y_train)

df0.loc[np.isnan(df0["id_03"]), "id_03"] = m.predict(x_test)
print(df0)

Think about intelligently filling the missing data if you have other clean inputs to rely on, the literature is filled with techniques to help you intelligently predict missing data using machine learning techniques. The above code shows you a (very) simplified example of how to do this using a KNN regressor in sklearn. For this exercise I used a data column (“id_03”) from the fraud database available from kaggle here. This dataset looks like swiss cheese, so it is a good example to practice filling data using ML methods.

A predictive approach to data filling is useful, as it might actually allow us to do smart data filling in a non-linear manner. These methods usually work way better than naive approaches – like using a mean or a median – if there are enough additional data columns to support the prediction. However adding hot-encoded columns with the information of when things were missing in the first place can also be very valuable as just the fact that some data point is missing might actually point to something very useful. For example, a missing timestamp in a log entry database might be extremely indicative of tampering, while filling it – in any way whatsoever – completely removes the information.

so, WHY are you missing that data?

In part 4 of my series of posts about finding where it would be best to open up a pizza place in the US we looked at how we could use google trends data to measure interest for pizza across all the different states in the union. Today I want to address some of the potential biases in the above search by gathering more data related to large cities in the US. This will give us an idea about how state/city data compare and allow us to get a more complete picture of how the relative search interest of people in the US has evolved over time.

Raw google trends data (blue) and yearly moving average (orange) for the top 100 US cities by population (2018)

To build our database I used the pytrends library in the same way that I did in the previous post. However, instead of states, I used data for the top 100 cities by population in the US as given by wikipedia. I then did a massive general plot of the evolution of the relative search and its yearly moving average for all the 100 top cities (arranged by population above), showing the general trends in pizza searches (in this case the term in google trends was “Pizza in city”).

Search trends in cities match the trends we’ve observed for states. Almost all US cities have a decline in search interest related to the term “Pizza in city” when we consider the yearly moving average. For 89% of these cities we actually see a decline for the one year moving average from a year ago. The image below – which shows the change in the yearly moving average from one year ago to present – clearly shows this as well.

From the cities with the most positive change in interest the most positive is Saint Paul, in Minnesota, which has an increase of around 20% in the one year average during the past year. This is in great contrast with the worst city – St.Louis , Missouri – which has a drop of more than 25% in this same value. Increases in interest then decay sharply from Saint Paul, with the second and third cities being Norfolk, Virginia and Boise, Idaho (where I currently live!).

Even though the above might appear somewhat positive for these cities, a more detailed look at their charts shows that the evolution of their interest is actually not that positive overall but their recent increases have merely corresponded to mean reversions after falls in interest during the preceding years.

2016 to present change in the one year moving average

Looking at an all-time historical picture the overall dynamic also appears even darker with only 4 cities showing increases from 2016 to present in their yearly moving average, as shown in the last image above. In this last case we can see that the only 4 cities with an increase are Detroit, New York, Nashville and Chicago. All of them significantly touristic and two of them extremely well known for their pizza styles.

After looking at this data for google trends for relative search strengths for all US states and their cities, it’s starting to become clear that the general trend in very broad pizza searches is to the downside. It makes me think whether the question should be “where should I open my pizza place?” or “should I even open up a pizza place?”. This might be one of the instances where the data is asking us to reformulate the question since it might be showing that our initial assumption – that there might be an obvious and strong interest for pizza somewhere – might actually not be true.

There is of course the possibility that we’re reading too much into the data. For example real interest for pizza might not be directly related with google searches – people might go to pizza places they know about via word of mouth or that they see opening up – so we might want to look for data that is unrelated to what people search but that is instead related to what people experience. Social networks like twitter, might offer us some data to help navigate that hypothesis. I’ll look into this for my next post!

In the third part of this series of posts we talked about biases within the data and why a change of dataset was needed to tackle the “pizza restaurant” problem. Today I’m going to talk about how I solved this bias problem by building a new dataset from an entirely different angle, using google trends. I will walk you through how I built the data as well as some of the information I found within it.

from pytrends.request import TrendReq
import pandas as pd
from tqdm import tnrange

states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
  "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
  "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

all_df = []

for i in tnrange(len(states)):
    state = states[i]
    pytrends = TrendReq(hl='en-US', tz=360)
    kw_list = ["Pizza in {}".format(state)]
    pytrends.build_payload(kw_list, cat=0, timeframe='today 5-y', geo='', gprop='')
    df1 = pytrends.interest_over_time()
    df1.columns = ["T_{}".format(state), "isPartial"]
    df1 = df1[df1["isPartial"] == "False"]
    df1["MA_{}".format(state)] = df1["T_{}".format(state)].rolling(52).apply(np.mean)
    df1 = df1.drop(["isPartial"], axis=1)
    all_df.append(df1)
    
all_df = pd.concat(all_df, axis=1)

The original data I wanted to use had two main problem. First, it was not representative enough of US pizza restaurants and second, it did not contain information about restaurant age. These two problems introduced an insurmountable amount of bias that prevented any inferences at reasonable confidence levels. Given that all the freely available datasets I could find about restaurants contained the same problems, I decided to change how I tackled the problem.

I though about going for a more indirect answer to the question. Given that I could not find free data to answer “where should I open my new pizza place?” using data from pizza restaurants themselves, I thought I could look for data that could represent how people in different areas want pizza. The simplest way to do this seems to use the google trends tool to create a database containing information about how people have searched for pizza in their area as a function of time. Google trends basically provides you with the relative search frequency of a term as a function of time.

Using the code above, which uses the pytrends library, I found the relative search frequency of the term “Pizza in state” for all the states in the US. Note that this means that the search must contain the terms “Pizza” “in” and “state”, it is not exclusive of other terms or require them to appear in that order. The image above then shows you the time series for each state from 2014 to present (in blue) plus the one year moving average of these values. As the values are naturally noisy, the one year moving average can more easily show us the long term trends within the data.

Since these values are relative (all go from 0-100) it is not very useful to look at how they compare to each other in absolute terms (because a 50 in Nebraska might be equivalent to just a 5 in New York), but it is very useful to look at whether the rolling average values have tended to increase or decrease during the past year. If we calculate the one year difference of the one year moving average and then plot them, we get the chart shown below.

The graph above is very interesting as it shows us that the search for “Pizza in state” has been mostly declining in importance across the board, with only 14 states showing a steady increase during the past year in their rolling yearly average. This is also clear in the “per state” plots as it’s easy to see that the majority of lines are slowly declining.

Interestingly the states at the top of our graph overlap significantly with the states that have the top ten number of restaurants in the US (see here). This is specially true of West Virginia which has the top amount of restaurants per capita and appears second in our plot. If we normalized by population the difference between them would be even larger as South Dakota has a population of just shy of 900 thousand while West Virginia is closer to 1.8 million.

The above does not necessarily mean that South Dakota is way better to open up a pizza restaurant, but it does show that interest for pizza in this state has been increasing at the fastest phase – in relative terms – compared to all the other states in the union, at least during the past year, as measured by relative internet search frequency. Even though this dataset obtained using google trends cannot offer a definitive quantitative answer to the question, it does help us discard a lot of states were interest has been heavily declining in recent times (like for example California and Utah).

It is also worth noting that this dataset – like all others – is not free of bias sources. One important problem, even if we’re only talking about interest for pizza, is that the search “Pizza in state” might be skewed if very large cities are present in the state (someone in LA might be much more prone to google “Pizza in LA” than “Pizza in California”), there is also a problem when state/city names overlap (like “Pizza in Washington” for example). What we are interpreting as a lack of interest, might therefore just be a shift towards more specific ways of searching for pizza places.

I will try to expand this “google trends” technique using the names of the biggest cities and capitals in the US. This will allow us to better understand the city Vs state bias and clarify whether West Virginia and South Dakota are really great places to open up pizza restaurants.

In part 2 of this post series we talked about how we could use web crawling to populate a pizza restaurant database with a proxy for “restaurant success”. Today I want to expand this series with some important issues with the current dataset that we need to very carefully consider when trying to use this data for real world predictions of restaurant success.

Pizza restaurants in the database by province

The free dataset I have used for this small project is this dataset available from kaggle with menu and location information for lots of US pizza restaurants. Although the data appears to be rather homogeneously distributed through the US – as we saw in part 1 of the series – the amount of data per state or zip code is in reality quite small, with only a handful of examples available for many states and in some cases, only a single case available for a given state. The graph above shows the number of restaurants per province for provinces that contain at least 5 restaurants.

The above problem makes inferences difficult as we can imagine that picking just 5 restaurants from a given state could seriously skew any inferences we make very heavily. For example only restaurants from the capital might be taken from one province while in another we might have restaurants from the capital plus some additional cities. Since each state is bound to have a different distribution of success, having such a small number of samples is no recipe for success. We can already see that our data is not even representing the national average frequencies properly, for example we have a lot of data for Texas but almost no data for West Virginia, even though this is the state where there are the most pizza places in the US (see here).

Another problem, which is just as serious, is the fact that failure probability is not homogeneous along the entire life of a pizza parlor. Most pizza places fail during their first 2 years while pizza places that have been open for longer have a dramatically smaller probability of failure. This is true of many businesses as a business that is able to succeed in the short basically found a market to be sustainable in and therefore can be expected to survive for far longer.

Probability of a restaurant to be open by province (per the dataset, labeled using web crawling using data from google about restaurant open/close status)

The problem then becomes that our proxy for success (whether a restaurant is open or not) is not useful if we don’t know the age of the pizza place since we could just be looking at places where restaurants have closed more just because more have been opened recently. This is likely why we see that the probability of a restaurant being closed is higher in Washington and Nevada, places where restaurants have been opened with a larger frequency during the past couple of years. Places where lots of pizza places are not opened are then likely to appear more successful but it’s just that no one is taking the risk to open up a new place.

I would definitely not imagine that North Carolina or Indiana are the best places to open pizza restaurants (which is what this dataset is telling me) but this is just a consequence of the age bias and the lack of proper sampling of restaurants, as described above. If they really are then we should be able to confirm this with an analysis that can account for the sources of bias mentioned.

How could we then get a better and more realistic picture of where it’s best to open up a pizza restaurant? I have several ideas right now about how to construct a database that is more reliable and that can actually use data that goes beyond the two above bias problems. Since we’re keeping this entirely free – as per my friend’s request – it’s not very easy, but whenever things are difficult, I’ve found it’s best to be more creative.

In the first part of this series of posts we explored how we could find data to attempt to predict the best zip code to open up a pizza restaurant in the US. We ended up with a database containing thousands of pizza places, with location, name and menu information. However this data is incomplete, as it basically describes restaurants but says nothing about their “success”. Today I will show you how we can actually do some web crawling to generate a variable that we can use as a proxy for restaurant success.

import requests
from time import sleep
from tqdm import tnrange
import numpy as np

df1 = pd.read_csv("pizza_db.csv")

headers = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) '
                  'Gecko/20100101 Firefox/61.0'),
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}

df1["is_open"] = 1

for i in tnrange(len(df1)):   
    q  = "{} {}".format(df1["name"].iat[i], df1["postalCode"].iat[i])
    q = q.replace(" ", "+")   
    url = "https://www.google.com/search?q={}".format(q)
    html = requests.get(url, headers=headers)    
    df1["is_open"].iat[i] = int("Permanently closed" in html.text)
    print(df1["is_open"].iat[i])
    sleep(np.random.randint(1,6))

After having a database that is decently descriptive and representative of pizza restaurants our next challenge is to figure out which variables are related with “restaurant success”. But what does “success” actually mean? We must first establish some measure of what we consider success so that we can actually determine if pizza restaurants are more or less likely to be successful in one location compared to another.

Ideally we would want to have something very tangible, like net pretax income for the pizza places, their amount of orders or similar information. However, this information is all private – hence impossible to access in any massive way – so we need to settle for a much simpler metric. The easiest metric -as it is publicly available – is to determine whether the restaurant is open or closed. At the very least a successful restaurant must be open.

The problem with this information is that it’s not within our database, so we need to construct it from scratch. To do this I devised a very simple web crawler – code above – that does systematic google searches in order to populate the “is_open” variable in the dataframe. The crawler searches the name of the restaurant plus the zip code and then determines whether the word “Permanently closed” is displayed in the search results. This is all we need to do since google will always display these words in the review snippet if the restaurant has been reported to be closed.

Funny Cartoons About Pizza | Randy Glasbergen - Glasbergen Cartoon

Of course, it’s not that simple. Google does not like systematic usage of their search engine for this type of database construction, so they will block you if you massively crawl using their search engine. To avoid this we must use a proper header (this is why I specified a “User-Agent” line) and we must introduce a random lag between crawling attempts in order to avoid being blocked. This works for this small case, just 3510 searches, but for any more complicated crawling you’ll actually need to do some additional work to prevent this from happening. This is a nice article explaining how to avoid blocking while doing this type of crawl.

In the end we obtain a database with a lot of descriptive variables and a single label indicating success (whether the restaurant is opened or closed). Within the next post in this series I’m gonna share with you an initial analysis of how location relates with success and whether or not we can construct simple location based models to determine our probability of having a successful pizza restaurant using only location data. We will also see how we can expand the meaningfulness or our location data using information from other databases (such as the US census).

Practical business questions are very important for data science since they represent the ultimate test of what the data scientist’s job will be at private companies. Taking on my free consulting offer a friend asked me to try to answer a simple question: what is the best zip code to open up a pizza restaurant in the United States? He provided no data, but wanted to know if I could give him any information with freely available data.

Location of pizza restaurants from the Datafinit dataset (some are even in Alaska or Hawaii)

I decided to take on the challenge and find out if I could give him any actionable data insights regarding pizza restaurants in the US. Today I want to talk about the first part of this project, finding the data, to show you some of the problems when searching for data and some of the ways you can find useful information.

The first information I came upon that could be useful was a dataset released by yelp containing information about the restaurants reviewed on their site. The yelp dataset is a subset of all the yelp data and it contains a wealth of information about restaurants including their ratings, whether they are open or closed, how many reviews they have and where they are currently located.

The yelp database contains information 1135 US pizza restaurants (make sure you filter by US zip codes) and since we know that there were 76,993 registered pizza restaurants in the US in 2018, we are using a sample here that represents around 1.5% of all the restaurants. Assuming that they drew randomly from the entire population we should be able to make inferences within a reasonable margin of error, especially if we avoid making any inferences about areas that are not broadly represented in the data set.

However, looking more carefully at the pizza restaurant data we see that we have only five states present (‘PA’ ‘NV’ ‘OH’ ‘NC’ ‘WI’), which makes it obvious that we don’t have a random draw over all US pizza restaurants but a fairly specific sampling of some states. This means that the yelp database can probably tell us a lot about pizza restaurants in these states but definitely not in the US as a whole. This is a shame as the yelp database contained information about restaurants closing, which we could use as a proxy for failure.

Google also shows review information for restaurants along with open/close status

We also have other databases, like this one by Datafinit, which provides a much more detailed and specific description of pizza restaurants in the US. This database contains a much better random draw of US restaurants – as evidenced by the map shown in the first image in the post – so we can make inferences from this data with more comfort. The data is also tagged with longitude and latitude location, so we can use this information to get US zip codes and then cross-reference it with other zip coded databases (such as the US census) to obtain information that will be very relevant but not obtainable from this data (like the number of pizza restaurants normalized by the number of people who live there). Of course there are biases present in this database – they for example did not include data for restaurants that had closed before they built it – so we need to consider that when we use this for analysis.

The main problem with the above data is also that it contains no proxy for success – we just have pricing and menu information – so it does not allow us to create a model that might predict the success of a pizza place but merely allows us to learn more about pizza places (what is generally known as unsupervised learning).

However, the useful thing about the above database is that we now have the zip code and name of a bunch of pizza restaurants. We could now use google to obtain information about these places by performing web crawling exercises and looking at the output given by the search engine. The second image in this post shows you that google gives us reviews as well as open/close status for a restaurant if we ask for it with both its name and zip code. If we wanted to spend a small amount of money (like 50 USD), we could also buy a database containing names and locations for a lot of pizza restaurants, this would make the above crawling exercise much more likely to give useful results (note that I have no affiliation whatsoever with this data selling place).

Image result for pizza cartoon

Although the above wasn’t an exhaustive data search by any means – and I look forward to any other sources or suggestions within the comments – it did provide us with an initial idea of where we can find information about pizza places and how we might use this information to begin to better understand the problem and eventually create a model that we can use to attempt to predict pizza place success. Since most restaurants close very early in their lifetimes, the ability to predict whether a business will remain open or not – especially with a very low false positive rate – is going to go a long way in helping us open a restaurant with higher chances of success.

I hope you liked this exercise in data searching. When solving real world data problems you want to be creative, stitch together data sources, find new information as use the information you have creatively. Within the next part we will look at how our data looks after we do some web crawling and get some initial insights about pizza and population data.

If there is one big difference between theoretical and real world exercises in data science – especially as it pertains to machine learning – it’s the presence of a much larger and often more pervasive array of bias sources. The effect of these biases can go from making predictions slightly worse, to making them extremely discriminatory under real life scenarios.

Take for example this credit card fraud dataset available from kaggle. This set basically provides us with unnamed variables and how they relate to a transaction either being fraudulent or not. We have no idea of what these variables represent and all we care about is to maximize how correctly or incorrectly we can identify whether a transaction is fraudulent. A data scientist focused solely on the metrics of machine learning success would have absolutely no problem with this, they just need inputs and outputs and they’ll tell you how you should use variable X to predict credit card fraud because that’s what gives the best [insert performance metric here].

Now what if this variable happens to be the distance between the origin of a transaction and the closest low income neighborhood? You’ve just told the credit card company to discriminate against poorer people. I tend to refuse work on data that has no description due to this exact reason. There is both a practical and moral decision aligned with the use of data that must go into data mining when solving real world problems. Not all variables will be good to describe credit card fraud, but not all variables that are mathematically good at describing credit card fraud should be used for this purpose. The reasons to use or discard data can and should go beyond math.

There are also cases where the moral implications are smaller but the practical implications are bigger. For example if a company wants to create a machine learning algorithm to predict whether a machine will fail they might be tempted to add as much measuring equipment as they can to solve the problem. They might take 2 million measurements per hour to try to solve the problem of a machine failing once every 3 months (after all the more the merrier right?). Here the company has just potentially added a huge amount of noise to the data that might make the creation of truly useful predictions just much harder than it should be. More data is not usually better, it can be much worse. If any addition of data was better we would just use random number generators to add new inputs to our models all day long.

The problem of what to measure is just as important as building the predictive models. A data scientist must be able to address the problem of what data needs to be gathered and used, not only the problem of how to turn the data received into a predictive model. This is a much harder problem, that often requires some domain expertise, which is probably why people with broad understanding abilities – such as physicists – tend to make better data scientists for industry compared with people with purely IT or mathematical skills. In the end predictive models are models about the world and to decide what aspects of the world need to be measured is perhaps more important than deciding how to use those measurements to create models.

The nature of the problem can also create substantial sources of bias. For example time-dependent data might be treated by a naive data scientist in the same way as time-independent data. With time-dependent data it is usually not a good idea to employ traditional cross-validation techniques – because you might be using information about the future that was never available in the past – which can then lead to models that are terrible at predicting new events in the data series. Sports betting is a clear example where this happens, try building a model with traditional cross-validation and see how you do (just don’t bet anything!).

The above are only a handful of cases but I hope they highlight how real world problems are often more complicated than the traditional (input|output -> model -> result) that can be common among people who are newer to the field. If you want to get better at real world data science skills it is important to become much better at identifying sources of bias. I certainly still have a lot to learn about this but just being aware of these problems is a huge step forward.

I got into data science for the results. I approached data science not because I found it enticing or interesting, I approached it because I had a problem to solve. For me data science was the chisel and the hammer, it was the tool I needed to carve the sculpture I wanted to make. I did not particularly like hammers or chisels but I loved the idea of making a beautiful sculpture.

My case is probably not typical. It is not very common now for people to have a problem and go into data science to solve it, but it is now common for people to go into data science as they go into calculus in high school. To go in with the idea of learning the toolkit without knowing what they’ll be making with it. To learn how to hammer and to chisel with no particular aim but to hammer and to chisel.

Image result for sculptor cartoon

This can create an obsession with improving the tools and with the metrics that tell you how good they are. In data science it creates obsessions with scoring metrics on standardized data sets. It ends up confusing the goal of making better tools with the goal of making sculptures. In the real world however, you’re usually not faced with the problem of improving tools, you’re given a job to do, with whatever tools you have available.

The shock from changing from academia to real world problems in data science can be quite big. In real life people don’t care too much about how fancy your model is, or how much you can improve your testing scores in cross-validation metrics. In the real world what counts is whether your models can accurately predict the future. Future ad revenue, future sales numbers, future fires, future crime rates, future whatever it’s all about the future. The elusive ability to “add real value” that the people who hire data scientists want more than anything else.

And as the proverb goes, it’s very hard to make predictions, specially about the future. This is because the future has some uncertainty related with it, being able to successfully predict it requires you to be very wary of all the sources of bias that you might carry into model building and data analysis. Biases that can be mostly immaterial when you’re doing academic research or just learning from curated vanilla problems.

To those of you who want to eventually use data science skills for real world applications my invitation, which is what I’m trying to do with this blog, is to ask you to leave the iris data set behind and – even if you know nothing about data science – start with a real life question that you want to answer. Think about the sculpture you want to make, then pick up the hammer and the chisel and go for it. It will take time but learning to use tools with a clear objective is – at least in my opinion – way more gratifying than just trying to learn how hammers hammer and chisels chisel.

For those of you who don’t know me, my name is Daniel Fernandez and I have worked in data science – more clearly in finance – during the better part of the last 10 years (you can read more about me here). I have also enjoyed writing through the years, mainly in my blog about currency trading and my blog about hydroponics.

Sadly my blogging hasn’t been too frequent during the past two years, mainly because my two blogging ventures got heavily restricted by agreements related with consulting work. In particular I now work a full time job at an asset management firm, which makes writing in my currency blog difficult given the nature of my work.

However, my love for both data science and blogging are still alive and well and I am, therefore, starting a completely new venture from scratch; with the hope that this will fill my desire to both explore new realms of knowledge and share my findings with everyone out there willing to read my content.

Data science is a very broad and tough field though – with many great blogs out there! – so, I want to differentiate my take by giving you insights into real world problems and real world scenarios. For this reason, I am offering my skills to those of you who have business, real life data science problems who want to get some work done, at absolutely no charge (sadly no finance related problems). My only condition for this would be for you to allow me to share the experience and results on this blog (of course without disclosing any specific names or private information). My free time is limited though, so I’ll only take a very limited number of requests at a time.

It’s a win-win, you get high quality work at no cost and I get interesting and unique material to share on my blog. If you’re interested, feel free to contact me using the contact form link, or leave a comment on this post. It’s a new start and I hope it will also be a great one!