In part 2 of this series of posts I looked at how using Seattle airbnbinsider data we could predict whether a given property was going to get above average reviews per month with an almost 80% accuracy. In this post we are now going to explore the creation of a regression model using this data to determine how effective we can be at actually predicting the specific reviews per month an airbnb can get. We will also look at a simple example of how a variable can affect our prediction target and how this can help us determine pricing for airbnb properties.

Real reviews per month as a function of the predicted value for the training and testing sets

Using an LGBM Regressor with the same input variables as shared in my last post we can build a model that attempts to predict the actual reviews per month for our properties. The image above shows the results of making predictions on training and testing sets created from a simple 70/30 randomized split of the data. The testing set curve in the above example has an R2 of 0.48 which, although far from perfect, shows a decent ability to make predictions.

Our problems are mainly in the accurate calculation of low review-per-month values while we do a better job at predicting values that are further up the spectrum. For lower values we tend to predict far higher reviews per month than what is actually observed within the data. We can see that this deviation exists even in our training data, so it’s not surprising that we show the same difficulty with this predictions in the testing set, given that we already do poorly with them in training. However it is clear that we are far from perfect all across the curve, as our current R2 value is just very far away from the ideal.

Variable importances for the LGBM Regressor model

Variable importance values show a very similar behavior when compared to our classifier model, with latitude and longitude being the most important variables, followed by price. Since price and cleaning fees are still very important variables, we can study them and see the effect that they have on our testing set when we change them in different ways. This can help us build a VERY primitive pricing model where we can change the price for an airbnb to determine how we can maximize the reviews per month it can get.

The graph below shows the ratio of the new prediction to the original prediction after a 20% reduction in price. As you can see for properties that have high reviews per month a 20% reduction in price forecasts little change, while for the lower part of the spectrum we sometimes predict increases in the number of reviews of almost 10x. This is definitely not a reality – a wonderful discovery that someone deeply mispriced their rental – but rather a fluke that’s related with how terrible the model is at predicting values around the low reviews per month. This can be even worse with the cleaning_fee where a property is predicted to increase its reviews per month by 1600x with a 20% reduction in price!

New prediction/Original prediction as a function of the original prediction after a price reduction of 20% for 200 cases in the testing set

While we cannot judge any single case prediction using plots like the above with a regression model that’s expected to have dispersion this terrible, we can look at some general trends and make some very simple inferences. For example the data does show that – in very general terms – a 20% reduction in price will tend to increase your reviews per month, especially if you’re on the lower end of the reviews per month. By how much, is not something we can say with this model, but at least we can see that there’s a general trend in that direction.

In the next post of this series we will look at improving this regression model, to hopefully get it to a point where we can make some deeper analysis about pricing and airbnb properties in Seattle.

In the first part of this series of posts, I looked at some initial data from insiderairbnb in Seattle and used a simple decision tree model to develop a basic understanding of the data and the relationships within it. Today I am going to talk about using a dataset with some expanded data, how we learn as a function of the number of examples within the dataset, how a more complicated model can help us greatly increase the accuracy of our predictions in this particular case and how variable importance changes when we move to a more robust model.

Learning curve using a LightGBM model to predict reviews_per_month > average

The initial dataset I used for this project was the “listings.csv” under the Seattle header, I thought that the contents were the same as the “listings.csv.gz” dataset – as they share the same name – but it turns out that the later contains a lot of additional variables that do not exist in the first set. This expanded dataset contains information such as the number of bedrooms, bathrooms and the 30/60/90 availability information. The database does require significant additional parsing though, as many numerical variables use “,” for indicating thousands and contain “$” symbols when price values are used. Below is the list of variables I ended up using for making predictions:

'neighbourhood', 'neighbourhood_group_cleansed', 'latitude', 'longitude',
        'is_location_exact', 'property_type', 'room_type', 'accommodates',
        'bathrooms', 'bedrooms', 'beds', 'bed_type', 'square_feet',
        'price', 'weekly_price', 'monthly_price', 'security_deposit',
        'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
        'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
        'minimum_maximum_nights', 'maximum_maximum_nights',
        'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
        'has_availability', 'availability_30', 'availability_60',
        'availability_90', 'availability_365'

The overall data composition is the same though, so the probability to classify an above average value for “reviews_per_month” is still 60%. Using the decision tree model discussed in the previous post I achieve an accuracy of 72% in 10-fold stratified cross-validation, better thanks to the additional variables now included. However, I thought I might try a more complicated model to see how both variable importance and prediction accuracy change.

The first image above shows you the learning curve for a LightGBM model trained on this same data. We clearly see that our learning in cross-validation continuously improves as we get more data, only reaching a plateau towards the end of the sample size. The convergence of our training and testing curves plus this tendency to learn more as a function of the amount of data does imply that our model is generalizing well, without significantly over-fitting. This model is also better at making predictions – as it uses boosting to improve predictive ability – taking our accuracy from 72 to 79% in stratified cross validation.

Importance for variables with importance > 0 for the LightGBM model
Probability as a function of price for an example in the testing set for the LightGBM model. Current price is showed as a green dot.

The importance of variables, in the second image in this post, also changes significantly when we go from a simple decision tree to LightGBM. The latitude and longitude suddenly become the most important variables – location, location, location – followed by the price of the rental. The cleaning fee, a variable that wasn’t present in the first dataset I looked at, also becomes increasingly important. However it is worth noting that these relationships are still heavily non-linear and changes in the variables can have unexpected effects in the probability to have higher than average reviews per month. The third image in this post shows you an example of this, where lowering prices actually lowers the probability to have above average reviews while increasing price increases this probability up to a point.

Given that a LightGBM model can be so successful as a classifier for “above average reviews per month” – with an accuracy of almost 80% – I wonder if we could actually build a successful regressor to tackle this problem. A regressor would be very useful since we would actually be able to see the specifically predicted average reviews per month as a function of the variation of any number of parameters we desire. With this in mind we could potentially evaluate how optimal the pricing of an airbnb is, if we would obtain substantial gains from lowering the cleaning fee, etc. We will look at building a regressor and evaluation its results on my third post in this series.

Rental markets provide large amounts of data where value extraction can help all players involved. A better constructed and priced market can benefit the people who rent, the people who own real estate and the management companies that aggregate and manage these properties. Although free data sources for rental markets are not very common we do have insiderairbnb which provides free-to-use data from airbnb that can help us gain insights into how these rental markets work. This is my first dive into rental markets and the insiderairbnb data so feel free to comment if you think I have made any mistakes processing or using this information!

Distribution of the number of reviews per month

As a small initial experiment I started to look into data for AirBNB in Seattle, given the structure of the insiderairbnb data I wanted to see if I could successfully predict the amount of reviews per month a given airbnb received. This is a useful data point to predict since it is expected to be significantly correlated with the number of stays and therefore the money earned by the person who owns the property and whoever manages it. It is also useful since – as a land lord for example – we might be interested in owning a property where the number of reviews can increase as quickly as possible, rather than being solely focused on the occupancy rate (which sadly is not available in the dataset).

The first initial step I carried out was to remove variables that are only going to be spuriously related with the reviews per month or other expressions of it. I therefore removed all the following variables from my dataset: “number_of_reviews”, “name”, “host_name”, “last_review”, “host_id” and “id”. With the data now cleaner for prediction I then proceeded to hot-encode all of the remaining string variables (“neighborhood”, “neighborhood_group” and “room_type”). This basically means that instead of having one “room_type” column, I will now have one column for each potential value of “room_type” with either a 0 or a 1, depending on whether that is true or not for that particular row. I also proceeded to remove all rows in the dataset where the “reviews_per_month” were zero, since this just means no valid data is available for evaluation.

Feature importance from simple single decision tree classification (max_depth=5, criterion=”entropy”, class_weight=”balanced”)

After the dataset was clean I then needed to decide exactly what to predict. Trying a regression approach to predict the exact review_per_month was likely not going to be a very easy problem given the characteristics of the data at hand (7795 examples with 114 data columns). Meaning that I was more likely to succeed by simplifying the problem to ensure my inferences would be stronger. In order to do this I decided to instead attempt to predict if the reviews per month were going to be higher than the average value expected from the dataset. The first image on this post shows you the distribution of reviews per month and the average (vertical blue line). In this particular distribution 60% of the values are below the blue line, so I need to have an accuracy of more than 60% to have an edge from just randomly picking from the dataset.

With this now in mind I started with a decision tree model so that I could understand which features were important overall and get an idea about what data interactions might lead to better predictions. I tuned a decision tree classifier using a stratified 10-fold cross validation, obtaining an average accuracy of around 70%, a significant edge over randomness. However – more importantly – the decision tree classifier offers us significant insights into things like variable importance and – even better – we can get a direct look into how tree is constructed, something far less tractable when we move to more complex tree-based methods.

Price as a function of the reviews per month, the relationship is clearly non-linear by a large degree

The second image in this post shows you the importance for variables with an importance metric greater than zero. These are the variables that are most effective in generating a good classification split. The amount of availability plays a key role – since we clearly expect properties that are available more to get more reviews – so does the minimum nights, the calculated number of listings for that same host, the price and room types of the rental property. These variables all make intuitive sense since we would expect rental properties that are more expensive to be rented less and properties that belong to more experience owners – people with more properties in airbnb – to do better overall as they are likely going to be better at handling the properties themselves.

Note though that these relationships are far from linear in nature and require a lot of additional insight to be useful (see the third plot in this post, for an example using price). Just because someone rents a property at 100 USD instead of 200 USD that does not mean that the person will automatically get way more reviews per month. The type of room, the availability, the minimum nights, etc, all play a big role in determining the role that price is playing in the mix.

The decision tree that was created from my model building. Click to expand.

One of the key advantages of building a simple decision tree model first is the level of learning we can do just by looking at how the actual tree is built. The image directly above shows you the graphical representation of the decision tree that was constructed from this data. The first split is actually an availability based split, followed by a price split to the left and a calculated number of listing for same host to the right. This is telling us that for places with no availability restrictions (0 value) the price is the next most important thing, while if there are availability restrictions then you need to look at proxy of the host’s experience. As the tree goes deeper we then get into the other variables, eventually asking questions as specific as whether the airbnb is located in a given neighborhood or not.

As you can see the topic is very interesting and a ton of potential insights can be made just from looking at one city’s data and understanding the outputs of a simple model, such as a decision tree. With this information we could already start helping someone make a better decision about what the ideal property to have the highest number of reviews per month would look like in Seattle. In the next part I’ll look into more complex models and the learning curves we get for this problem.


Several people have approached me since I started this blog asking about projects involving predictions in sports, either in real sports or in fantasy sports leagues. Sadly I do not follow any sports and lack some of the critical domain knowledge necessary to carry out this task from scratch, however, my friend and colleague Waco has been doing sports betting as a hobby for some time and was kind enough to share some of his hockey data and knowledge so that we could work together on improving his current model. Note that he should take most of the credit for the insights leading to these results. This article contains my first impressions about using data science for the building of sports models and some of the results we have been able to obtain.

Learning curve of a classifier model trying to predict if the home team will win. Our model reaches an impressive 77% accuracy in cross-validation (blue line is expectation from chance)

Sports data has several characteristics that make the prediction of outcomes difficult. The events have strong statistical drivers – such as how good the players are – yet they have significant contributions from randomness, such as whether players get injured during a game or whether players are having internal conflicts. Things like the weather, game attendance or even the family lives of the players can all impact games irrespective of how good or bad the players are on a leveled statistical ground. Whether the sport being played is a truer team sport or a more individual centered sport – like baseball – also plays a huge role in determining which variables are useful for prediction and what sources of randomness can be more important.

In sports, domain knowledge plays a huge role. The dataset I started with has more than 40 variables that correspond to statistics known before the beginning of each game at each point in time for both teams playing, with data for more than 10K games spanning more than 10 years of data. The chance of winning by picking a team randomly is close to 50% but increases to 54% if you pick the home team! That’s a very simple edge that comes from “very basic” domain knowledge.

The home/away team is a very important discriminator. If we try to predict games without discriminating which set of statistics belongs to the home or the away team we get scores that are close to what we expect from randomness, regardless of what fancy model we try to use, while aligning our statistics to have separate variables for the home and away teams ensures that we get a very substantial statistical edge. This is because some statistics matter way more when you’re playing as the home team while others matter more when you’re the away team. Knowing if the team is playing as a home or away team allows us to pay attention to a specific set of features instead of another, depending on each team’s home/away status.

Importance as measured by an extra tree classifier for 25 randomly sampled variables from our total set of features. Note that this model was just used to measure importances.

Interestingly the predictive ability of the variables is all fairly homogeneous and the variable importance – as showed in the image above – is fairly well distributed across the variables we have. The above shows only 25 randomly drawn variables from all those in our set, but it does retain the overall shape of the complete curve. Although there are around 10 features with significantly higher importance, using only those ten restricts the accuracy of generated models very substantially.

After properly grouping our variables we then proceeded to build models, with our best shots achieving a mean accuracy of around 77%+ in cross-validations, significantly higher than the expectation from chance (54%). The learning curve that is shown in the first image in this post represents how our model learns as a function of the number of examples. This curve shows that we gain most of our insights from the first 2K games and then only have very marginal improvements from the rest of the data. This is a good example of irreducible error, where we are unlikely to get better results with more examples or more complicated models but we basically require data that gives insights not present within our current statistics. This is expected to become impossible at some point, mainly due to the random component inherent to sports in general.

Distribution of accuracy scores in 50-fold cross-validation

The image above also shows that our model is unlikely to be curve-fitted as its dispersion in scores in 50-fold cross validation is small, showing that the insights it has gained can provide good generalization over the problem of hockey games, at least in the historical period we have studied. We will see if it is able to make predictions that are this great during the next hockey season!

Of course, you might have noticed that the data source and actual statistics/models we used have been purposefully obscured through this post – as they are a part of Waco’s IP – but if you would like to gain more information about this model or gain additional insights into hockey or other sports I would encourage you to contact Waco either on his twitter or linkedin accounts.

The amount of knowledge we have produced as a species grows every year. With so much content, it’s becoming impossible for us to completely absorb and digest the possibilities within it, making us limited in our ability to make as many inferences as would be possible if we could actually fit everything we produced inside our heads. Systematically studying the scientific literature with machine learning tools is going to be extremely important if we are to exploit our current scientific findings to their full potential.

import http.client as httplib
import urllib
from bs4 import BeautifulSoup
import re
import numpy as np
from time import sleep
from requests import get

class GoogleScholarCrawler:
    @brief This class searches Google Scholar (
    def __init__(self):
        @brief Empty constructor.

    def crawl(self, terms, limit):
        @brief This function searches Google Scholar using the specified terms.
        headers = {
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
            'referrer': '',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en-US,en;q=0.9',
            'Pragma': 'no-cache',
        i = 0
        results = []
        while i < limit:    

            params = urllib.parse.urlencode({'q': "+".join(terms)})
           #make sure you remove "amp;" from lines below in case they show up
            if i == 0:
                url = "{}&{}&num=10".format(i, params)
                url = "{}&num=10".format(params)
            resp = get(url, headers=headers,allow_redirects = False)   
            status = resp.status_code
            if status==200:
                content = resp.content
                soup = BeautifulSoup(content)
                citations = 0
                for record in soup.findAll(attrs={'class': 'gs_rt'}):
                        a = record.findAll('a', href=True)[0]["href"]
                print ("ERROR: ")
                print (resp.status_code)
                return results
            i += 10
        return results

#example code
crawler = GoogleScholarCrawler()
results = crawler.crawl(['asteroid mining'], 10)
for r in results:

Mining the scientific literature for insights is not a new concept. However, we now have the computational learning tools that are necessary to start making really meaningful insights. Scientific articles like this recent one show how we can actually use papers to make predictions that are outside of what scientists have imagined up until now. Using only abstracts – not even the full papers – these scientists managed to make meaningful forecasts about potentially new thermoelectric materials. Even more impressive is the fact that all these inferences were done from a purely linguistic basis, without ever inserting actual knowledge about physics or chemistry into the machine learning process.

There are many problems where we could use a tool like this, but even the construction of a database to perform these searches is incredibly difficult and restrictive. Journal and indexing websites that have this information restrict its usage to paid subscribers, sometimes even charging more money for API access to facilitate the above process. This makes any effort resembling the above limited to those who have the access and resources to carry them out, leaving a ton of potential advances out of the question, simply because only a few privileged people have access to the data.

To try to solve this problem, I have though about the tools that we might use to create an actual database of scientific abstracts for a given topic, in a manner that is completely free. The first free tools available that came to mind was google scholar, which allows us to perform rather extensive searches of the scientific literature and obtain the url of the relevant papers. Taking some inspiration from a couple of previous and old google scholar python scripts available online, I created the new code you can see above.

Cartoon of the discovery of the inverse impact law exhibit in the pre-modern hall of the science history museum

This very simple crawler queries the google scholar website searching for some specific keywords and then returns the number of results requested by the user. The above code also includes an example that returns the first 10 results for “asteroid mining”. In this first step the crawler returns only the links to the pages and the next step will be to build a second crawler to then use these urls to extract the abstract information, from at least the most popular journal websites. With the url information we could then filter by publisher or indexer, which should make the task a lot easier.

The above crawler might also be too simple for practical use, since google may easily ban your attempts due to its rather obvious crawling activities, the use of proxies and other tactics might be necessary to build a database with 10K+ results for a given topic. However it is a first step and hopefully it will allow me to start constructing databases to tackle some scientific topics that I find very interesting.

In the first post in my series about predicting ion concentrations using a cheap method, we looked at the AfSIS database and how it can be used to create models for the prediction of several ion concentrations in soil using cheap-to-measure spectroscopic data. However it was clear within these results that not all ions were equally easy to predict, with Zn and P giving us the worst prediction results overall. On today’s post I want to look at why this is the case and what we may be able to do to actually improve our results.

Distribution plots of concentration for the different ions across all the experiments
Spectra for max (blue) and min (orange) concentrations of the different ions in the AfSIS database

To understand why you’re bad at predicting a variable with a set of data, it is first important to understand how the data you have relates to your target variable and potentially others you might be better at predicting. Since we are trying to relate spectra with concentrations, it makes sense to look at the spectra we may expect at the extremes of the distribution regions. Since spectroscopic phenomena are in general expected to be at least directly proportional to concentration, this gives us an idea of our potential signal-to-noise ratio. Besides this we can also look at how the different concentrations of the ions relate, since ions that are more concentrated could be rationally expected to be easier to predict as their signal-to-noise ratio should also be higher (provided the response for all ions is fairly similar).

The first plot in this post shows you the distribution of concentrations for the different ions, from here we can already see that predicting Al, Mg, Ca and Mn should be easier, given that we have higher absolute concentrations and a better distribution of examples from the minimum to the highest concentrations in the database. The cases for P and Zn don’t look too strong as most of the samples have very low concentrations relative to the above elements.

The second plot above shows you a min/max analysis, with the spectra corresponding to the highest and lowest concentrations in the database for each ion. Here you can see that the variables we’re good at predicting – such as B and Ca – have significantly large variations from minimum to maximum concentrations. We can also see that in the case of Zn and P the variations are poor, where in the case of Zn the spectra of the minimum and maximum Zn concentrations are almost identical, which means our chances of predicting Zn from these spectra are way worse when compared to any of the other ions.

Square of the Pearson correlation (R^2) of the ions vs each wavelength of the spectra provided across the entire database.

Another useful analysis is to look at how the concentration of our ions relates linearly to each wavelength across the entire database. We can do this by calculating the R^2 of all the concentration Vs individual wavelength plots, which finally leads to the graph above. In this case we can see that ions that are easier to predict have relatively linear relationships with some wavelengths, while the ions that are hard to predict basically never go above 0.15, meaning that there are no simple linear relationships within the database between any wavelength and their concentrations. However there are ions that are easier to predict than P or Zn – like Fe for example – whose maximum linear correlation also stays in that region. This means that we predict Fe mainly due to the presence of non-linear relationships within the data, however these relationships seem to be far weaker for P and Zn.

A plot like the above also hints at how we might be able to simplify models for the different ions. For example if we create a model to predict Mg concentration that just uses wavelengths between 4500 and 6000 we can basically obtain the same level of accuracy as a model using the entire spectrum, just because within this frequency we get the most information. An LGBM model created using data for this region for Mg achieves an R^2 of 0.68, just as high as the model shown in the previous post.

With the above information it is clear why P and Zn are hard to predict. Low signal-to-noise ratio, no strong linear relationships with any wavelength and weak non-linear relationships overall compared with other ions, like Fe. Is there any hope then? I want to try a few other things, so stay tuned for the next post in this series!

For growers it is generally very important to know the composition of their soil. A soil that lacks certain nutrients will commonly need some form of amendment while a soil that contains very high concentrations of certain elements – like aluminium for example – might not be suitable for growing certain crops. However measuring the concentration of metal ions in soil is not trivial – since direct lab measurements of ion concentrations are expensive – so a cheap way to do this using a “field friendly” method would be invaluable. Can we create accurate models to predict metal concentrations based on a cheap measurement? We’ll see!

The output of an LGBM regression algorithm across 5-fold cross-validation for testing sets for 12 different elemental concentrations (log(M+1)).

This is where the AfSIS database comes into play. This set of data – created by the Africa Soil Information Service project, contains the output of a cheap spectroscopic technique with the actual lab analysis results of hundreds of different samples, distributed along the African continent. With this spectral and analysis data we can then construct statistical models to attempt to predict actual ion concentrations from these simple and cheap IR spectroscopy measurements. Better yet, the AfSIS database is freely available and hosted by Amazon AWS. The database files even contain some simple examples of how to carry out basic machine learning and statistical analysis using the project files.

Several years ago, a kaggle competition was actually done for this and the winner – not surprisingly – used a complex ensemble of models using support vector machines and neural networks with different levels of complexity (view the winning solution here). The scores of the final solution were also not very high and can now be surpassed with new boosting algorithms developed in recent years (like LGBM) plus the use of additional data that has been collected by the project since that time. The kaggle challenge also addressed 5 measurements Ca, P, pH, SOC and Sand, while I find it way more interesting – from a practical standpoint – to look at all 12 elemental concentrations that have been measured.

Average square of the Pearson correlation of the real vs predicted log(1+M) plots for test sets in 5-fold cross-validation

Using a simple boosting regression approach (LGBM) with no optimization, using the gradient of the spectral information and the Log(M+1) of the ion concentration as the target, I was able to obtain the results shown in this post. The first image shows the results of 5 testing sets obtained with random shuffling of the data for each ion while the second image shows the mean square Pearson correlation coefficient of these graphs. We can see that pretty good predictions can be achieved for Ca, Mg, Al and B, while the problem becomes way harder for elements like P and Zn.

Notice that at this point I have carried no extensive effort to improve the model, since at this point I’m just interested in understanding the problem. What’s hard to predict, what’s easier to predict, so that I can better understand where to focus. I want to know where the spectral measurements made hold the most value and where they are weak.

The questions we need to ask now are, what makes some elements easier to predict than others? Why are Zn and P so hard to predict from this data? Can we use the relatively accurate predictions we have for Mg and Ca to enhance our predictions for other ions? Is there a more intelligent way to preprocess this data to extract information? Stay tuned for part two in this series, where I will try to answer some of these questions.

There are many different kinds of data problems in this world, some are tackled very easily and some require very complicated solutions. However, some problems might be very difficult to solve at all, independently of the level of complexity of the solutions attempted. Today I want to talk about a key aspect of prediction problems: the degree of determinism of the problem at hand. I’ll start with some examples that align more with reinforcement learning, mainly games, to move into examples that are more related with traditional regression/classification problems.

Was beating the world experts in the game of Go, a difficult problem? Yes it was, and it has been one of the crowning achievements of deep learning so far. It was a very deterministic problem, a problem with known rules and known agents, a problem that requires a very complicated solution – yes – but coming from a very simple framework. You know what you have to do, you know the consequences of every action and you can practice as much as you want. Games like this one are as deterministic as can be – solving them might not be trivial – but the difficulty comes from complexity in learning the potential outcomes of the game, not from now knowing how the game is played or the consequences that actions have.

Now, imagine that a game of Go contained a random component, where there was some probability that moving a piece to one place would cause a completely different action to happen. Can we learn how to play a game like this? If the game is now less deterministic – there’s noise in it – then being good at it becomes way harder and the more random it becomes the less likely that a player will be able to become skilled a it. If the game was fully random – every action led to an unknown consequence – then it would simply be impossible to learn, regardless of how much we played.

Image result for determinism data science cartoon

Some problems are like a deterministic game, they are more clearly determined in nature and their solution is more fixed (whether that solution is simple or complex) while others are much more random and the degree to which we can learn the real “game” vs the randomness around it is not very well understood. A problem like classifying a picture of a cat or dog is much more deterministic – all we need is enough examples to do a good job – while a problem like sports betting or predicting the weather is much more difficult.

The perceived non-deterministic nature of problems usually relates to the amount of potential variables that affect their outcomes and our ability to create or observe examples for them. The variables that affect the outcome of something like a baseball game may go from someone yelling at the game to the weather, to what each player ate for breakfast that day while the outcome of whether the picture of something is a dog or a cat, will just depend on whether it fits a known canon of what a dog or cat is expected to be according to what we all know dogs and cats are. I know I can learn that canon from looking at pictures of dogs and cats, but I don’t know what information I need to gather to successfully predict baseball games, in one case my world is constrained, in the other it’s way less determined. Both processes are predictable in the end – if we possessed all information – but in one case acquiring enough information to make an accurate prediction is way harder than in the other.

In terms of making predictions, easier problems are usually defined by very strong relationships between the variable we want to predict and the observed variables while harder problems are associated with complicated relationships that may be hard to distinguish from the relationships expected from random chance. If we don’t know what to measure, we might measure many things that have nothing to do with our problem and end up with confounding results. That’s why the solution to not knowing what to measure, is never to measure everything.

If you have a non-deterministic problem, especially if the amount of examples you have is heavily constrained, then you have to be very intelligent about the observable variables you will be measuring and how you will account for spuriously finding variables that predict your desired outcome, just because you’ve searched for long enough.

Imagine if you were tasked with predicting the outcome of a machine using the input from as many other similar machines as you wanted. All of these machines are true random number generators (but you don’t know that). How long would it take for you to give up? Would you ever?

In the last couple of article in this series we discussed google search trends when looking at relative search importance of “Pizza in (state)” and “Pizza in (city)” as they relate to all US states and the top 100 largest US cities. This led us to believe that search trends have been going down across the country, which suggested that, perhaps, pizza was not the best choice of restaurant. Today I want to talk about some bias issues with these previous search results and how I addressed them using geo-tags in search results.

Five year change in the one year moving average for the term “Pizza” for searches done from each particular state

Both of my previous posts clearly showed that the relative search importance of people looking for “Pizza in (state or city)” were overall decreasing, which I – perhaps too quickly – interpreted as a general lack of interest for pizza. However a friend made me realize that I was forgetting a huge part of potential pizza searches, mainly those associated with searches like “Pizza near me” or “Pizza near (insert landmark)”. Maybe it is people changing their search habits, not their love for pizza.

This means that what I was observing before was not a general decrease in interest for pizza but – most likely – a general change in the way that people search for pizza places. Since search engines are now extremely location-aware, it has become unnecessary to be so specific, and much simple queries, like “pizza near work” or “pizza close by” have probably become way more common.

To solve this problem I have changed my pytrends code to instead look for any searches that contain “Pizza” but only take into account searches done in every particular state. That way I am more truly measuring actual search interest in pizza that is not biased by a particular way of searching. This also solves the problem of accounting for searches for people unlikely to purchase pizza – since I might google “Pizza in chicago” but might never actually go there – so it is a win-win scenario in terms of bias reduction.

Google trends relative search interest evolution for searches including “Pizza” carried out from each US state

The first image in this post shows you the change in the one year moving average from the start to the end of the data and in this case, to my surprise, we see that for almost all states interest has been increasing. The states with the highest interest increases are Oregon, Virginia, Delaware, Connecticut and West Virginia, all states that rank high in their current number of pizza restaurants and their growth during the past several years.

The second figure shows you the google trends evolution for all the different states. Here we can see that West Virginia has the steadiest growth, and since this is relative interest, a steady move towards the upside implies that absolute interest is increasing (since the 100 measure has been reached repeatedly in the past) so we should prefer states where relative interest makes consistent highs relative to other states – like Oregon – where the net increase in net interest has been higher but higher touches of the 100 value are less frequent.

Square Pearson correlation coefficients for the yearly moving average Vs Time plots

The square of the Pearson correlation coefficient can give us an idea of what the most stable increase of interest has been through time. Here we can see that Wyoming – which has only had a modest raise in relative interest – has done so at the steadiest pace among all the states. Idaho and Kansas have also had similar steady increases while the highest interest increasing states – like Oregon and West Virginia – have actually been around the middle in terms of their stability.

With this information I am now more confident that pizza interest in the US is alive and well – which matches my actual real life experience way better – and I think these results do point us to some key states where demand for pizza is expected to be greater or more stable. Using only free data, we have therefore narrowed down our possibilities and can now say a lot more about interest for pizza in the US. If past trends are to continue – big if – then the safest bets for a new pizza place are probably Wyoming and Idaho, given the high steadiness of their trends.

The above is however only observation and speculation, since we don’t have a proxy for restaurant success it’s hard to know if the above affirmations actually hold and if they do correlate with restaurant success. However, we have taken free data a long way and learned a lot about pizza places through the process! I hope you enjoyed this first journey!

Competitions around AI and machine learning in general have become more and more common during the past five years. In these competitions you are generally given two sets of data – a training set and a testing set – and you are asked to submit your predictions on the testing set data. In this post I want to talk about how these competitions work, what you can learn from them and, most importantly, what you will not be able to learn from them.

Image result for competition cartoon

These competitions are generally created by industry participants and they are meant to increase their ability to carry out some particular task. For example netflix created a competition to enhance their ability to classify whether a user would like or not like a certain piece of media, other examples include predicting kinetic constants in drug development, fraud in credit card transactions or daily returns in financial markets.

You can participate in these competitions for free and they are great to get you to practice the basics of machine learning problems with some real – yet already curated and built – data sets. With these data sources you can generally learn to tackle basic problems in data cleaning, like dealing with missing values, outliers and you can learn a lot about the basic technical aspects of model creation, including how to do things like cross-validations, training, predicting, etc. They are also great to give you practical experience with the most commonly used and powerful machine learning algorithms/libraries and how to actually use/tune them in practice.

The problem with these competitions is their focus on the “last mile” part of machine learning problems. Generally most people in these competitions will reach the 95th percentile of the scoring metric very quickly but there might be 1000 places between the 95th and 99th percentile. Imagine you’re evaluated with the accuracy of your predictions and the best contestant is making predictions with a 75% accuracy, an experienced person in machine learning will probably get to 71% very quickly but it will be a struggle to get to that 75%.

In real life, the “last mile” problem is very rarely the problem that gets the most attention. An algorithm making a 71% prediction might be just as good as an algorithm that makes a 75% accurate prediction, specially if getting to that higher accuracy would imply spending 5x more resources. In most cases extracting this last increase in the goal metric is not going to be cost-efficient, unless the problem is extremely valuable from an economic perspective. The dollar cost of having a senior machine learning engineer or data scientist tackle the “last mile” problem is just not worth it for most companies.

Under real life circumstances the most important factors are often not related with “how good” the algorithm is but instead on how practical it actually is. This means thinking about how easy it is to get the data to run it, how easy it is to retrain and – very commonly – how easy it is to deploy. It might be more desirable to use an algorithm that is simple to deploy and retrain – that delivers 95% of the prediction score of the best state-of-the-art solution – than to use a far more complicated algorithm that gives you a little bit more at the expense of a lot of additional time and complexity.

A snapshot of the leaderboard for a kaggle fraud detection competition. Position 1000 is already at 98% of the value of the top score.

Another key aspect of real life model building is also data gathering and curating. In a kaggle-style competition the decision of what data has been gathered and how it is curated has already been made for the most part, while in real life deciding exactly what data to measure, use and take into account is also a critical part of the model development process. It is usually way easier to get better gains by choosing better variables to measure than it is to use a set of sub-optimal values to attempt to extract more information. Feature selection – in terms of what information is extracted from the real world to make predictions – is perhaps the most critical aspect when building models.

In the end, these competitions are a great tool for someone who is new to data science – or for anyone who enjoys the “last mile” problem – but they are not going to make anyone, by themselves, a real-world data scientist. They are also great to have as a hobby, especially if you want to stay in touch with the latest python/R libraries for machine learning.