In part 2 of this post series we talked about how we could use web crawling to populate a pizza restaurant database with a proxy for “restaurant success”. Today I want to expand this series with some important issues with the current dataset that we need to very carefully consider when trying to use this data for real world predictions of restaurant success.
The free dataset I have used for this small project is this dataset available from kaggle with menu and location information for lots of US pizza restaurants. Although the data appears to be rather homogeneously distributed through the US – as we saw in part 1 of the series – the amount of data per state or zip code is in reality quite small, with only a handful of examples available for many states and in some cases, only a single case available for a given state. The graph above shows the number of restaurants per province for provinces that contain at least 5 restaurants.
The above problem makes inferences difficult as we can imagine that picking just 5 restaurants from a given state could seriously skew any inferences we make very heavily. For example only restaurants from the capital might be taken from one province while in another we might have restaurants from the capital plus some additional cities. Since each state is bound to have a different distribution of success, having such a small number of samples is no recipe for success. We can already see that our data is not even representing the national average frequencies properly, for example we have a lot of data for Texas but almost no data for West Virginia, even though this is the state where there are the most pizza places in the US (see here).
Another problem, which is just as serious, is the fact that failure probability is not homogeneous along the entire life of a pizza parlor. Most pizza places fail during their first 2 years while pizza places that have been open for longer have a dramatically smaller probability of failure. This is true of many businesses as a business that is able to succeed in the short basically found a market to be sustainable in and therefore can be expected to survive for far longer.
The problem then becomes that our proxy for success (whether a restaurant is open or not) is not useful if we don’t know the age of the pizza place since we could just be looking at places where restaurants have closed more just because more have been opened recently. This is likely why we see that the probability of a restaurant being closed is higher in Washington and Nevada, places where restaurants have been opened with a larger frequency during the past couple of years. Places where lots of pizza places are not opened are then likely to appear more successful but it’s just that no one is taking the risk to open up a new place.
I would definitely not imagine that North Carolina or Indiana are the best places to open pizza restaurants (which is what this dataset is telling me) but this is just a consequence of the age bias and the lack of proper sampling of restaurants, as described above. If they really are then we should be able to confirm this with an analysis that can account for the sources of bias mentioned.
How could we then get a better and more realistic picture of where it’s best to open up a pizza restaurant? I have several ideas right now about how to construct a database that is more reliable and that can actually use data that goes beyond the two above bias problems. Since we’re keeping this entirely free – as per my friend’s request – it’s not very easy, but whenever things are difficult, I’ve found it’s best to be more creative.