Practical business questions are very important for data science since they represent the ultimate test of what the data scientist’s job will be at private companies. Taking on my free consulting offer a friend asked me to try to answer a simple question: what is the best zip code to open up a pizza restaurant in the United States? He provided no data, but wanted to know if I could give him any information with freely available data.
I decided to take on the challenge and find out if I could give him any actionable data insights regarding pizza restaurants in the US. Today I want to talk about the first part of this project, finding the data, to show you some of the problems when searching for data and some of the ways you can find useful information.
The first information I came upon that could be useful was a dataset released by yelp containing information about the restaurants reviewed on their site. The yelp dataset is a subset of all the yelp data and it contains a wealth of information about restaurants including their ratings, whether they are open or closed, how many reviews they have and where they are currently located.
The yelp database contains information 1135 US pizza restaurants (make sure you filter by US zip codes) and since we know that there were 76,993 registered pizza restaurants in the US in 2018, we are using a sample here that represents around 1.5% of all the restaurants. Assuming that they drew randomly from the entire population we should be able to make inferences within a reasonable margin of error, especially if we avoid making any inferences about areas that are not broadly represented in the data set.
However, looking more carefully at the pizza restaurant data we see that we have only five states present (‘PA’ ‘NV’ ‘OH’ ‘NC’ ‘WI’), which makes it obvious that we don’t have a random draw over all US pizza restaurants but a fairly specific sampling of some states. This means that the yelp database can probably tell us a lot about pizza restaurants in these states but definitely not in the US as a whole. This is a shame as the yelp database contained information about restaurants closing, which we could use as a proxy for failure.
We also have other databases, like this one by Datafinit, which provides a much more detailed and specific description of pizza restaurants in the US. This database contains a much better random draw of US restaurants – as evidenced by the map shown in the first image in the post – so we can make inferences from this data with more comfort. The data is also tagged with longitude and latitude location, so we can use this information to get US zip codes and then cross-reference it with other zip coded databases (such as the US census) to obtain information that will be very relevant but not obtainable from this data (like the number of pizza restaurants normalized by the number of people who live there). Of course there are biases present in this database – they for example did not include data for restaurants that had closed before they built it – so we need to consider that when we use this for analysis.
The main problem with the above data is also that it contains no proxy for success – we just have pricing and menu information – so it does not allow us to create a model that might predict the success of a pizza place but merely allows us to learn more about pizza places (what is generally known as unsupervised learning).
However, the useful thing about the above database is that we now have the zip code and name of a bunch of pizza restaurants. We could now use google to obtain information about these places by performing web crawling exercises and looking at the output given by the search engine. The second image in this post shows you that google gives us reviews as well as open/close status for a restaurant if we ask for it with both its name and zip code. If we wanted to spend a small amount of money (like 50 USD), we could also buy a database containing names and locations for a lot of pizza restaurants, this would make the above crawling exercise much more likely to give useful results (note that I have no affiliation whatsoever with this data selling place).
Although the above wasn’t an exhaustive data search by any means – and I look forward to any other sources or suggestions within the comments – it did provide us with an initial idea of where we can find information about pizza places and how we might use this information to begin to better understand the problem and eventually create a model that we can use to attempt to predict pizza place success. Since most restaurants close very early in their lifetimes, the ability to predict whether a business will remain open or not – especially with a very low false positive rate – is going to go a long way in helping us open a restaurant with higher chances of success.
I hope you liked this exercise in data searching. When solving real world data problems you want to be creative, stitch together data sources, find new information as use the information you have creatively. Within the next part we will look at how our data looks after we do some web crawling and get some initial insights about pizza and population data.