Where should I open my new pizza place? Part 4: A google trends approach

In the third part of this series of posts we talked about biases within the data and why a change of dataset was needed to tackle the “pizza restaurant” problem. Today I’m going to talk about how I solved this bias problem by building a new dataset from an entirely different angle, using google trends. I will walk you through how I built the data as well as some of the information I found within it.

from pytrends.request import TrendReq
import pandas as pd
from tqdm import tnrange

states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
  "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
  "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
  "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
  "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

all_df = []

for i in tnrange(len(states)):
    state = states[i]
    pytrends = TrendReq(hl='en-US', tz=360)
    kw_list = ["Pizza in {}".format(state)]
    pytrends.build_payload(kw_list, cat=0, timeframe='today 5-y', geo='', gprop='')
    df1 = pytrends.interest_over_time()
    df1.columns = ["T_{}".format(state), "isPartial"]
    df1 = df1[df1["isPartial"] == "False"]
    df1["MA_{}".format(state)] = df1["T_{}".format(state)].rolling(52).apply(np.mean)
    df1 = df1.drop(["isPartial"], axis=1)
all_df = pd.concat(all_df, axis=1)

The original data I wanted to use had two main problem. First, it was not representative enough of US pizza restaurants and second, it did not contain information about restaurant age. These two problems introduced an insurmountable amount of bias that prevented any inferences at reasonable confidence levels. Given that all the freely available datasets I could find about restaurants contained the same problems, I decided to change how I tackled the problem.

I though about going for a more indirect answer to the question. Given that I could not find free data to answer “where should I open my new pizza place?” using data from pizza restaurants themselves, I thought I could look for data that could represent how people in different areas want pizza. The simplest way to do this seems to use the google trends tool to create a database containing information about how people have searched for pizza in their area as a function of time. Google trends basically provides you with the relative search frequency of a term as a function of time.

Using the code above, which uses the pytrends library, I found the relative search frequency of the term “Pizza in state” for all the states in the US. Note that this means that the search must contain the terms “Pizza” “in” and “state”, it is not exclusive of other terms or require them to appear in that order. The image above then shows you the time series for each state from 2014 to present (in blue) plus the one year moving average of these values. As the values are naturally noisy, the one year moving average can more easily show us the long term trends within the data.

Since these values are relative (all go from 0-100) it is not very useful to look at how they compare to each other in absolute terms (because a 50 in Nebraska might be equivalent to just a 5 in New York), but it is very useful to look at whether the rolling average values have tended to increase or decrease during the past year. If we calculate the one year difference of the one year moving average and then plot them, we get the chart shown below.

The graph above is very interesting as it shows us that the search for “Pizza in state” has been mostly declining in importance across the board, with only 14 states showing a steady increase during the past year in their rolling yearly average. This is also clear in the “per state” plots as it’s easy to see that the majority of lines are slowly declining.

Interestingly the states at the top of our graph overlap significantly with the states that have the top ten number of restaurants in the US (see here). This is specially true of West Virginia which has the top amount of restaurants per capita and appears second in our plot. If we normalized by population the difference between them would be even larger as South Dakota has a population of just shy of 900 thousand while West Virginia is closer to 1.8 million.

The above does not necessarily mean that South Dakota is way better to open up a pizza restaurant, but it does show that interest for pizza in this state has been increasing at the fastest phase – in relative terms – compared to all the other states in the union, at least during the past year, as measured by relative internet search frequency. Even though this dataset obtained using google trends cannot offer a definitive quantitative answer to the question, it does help us discard a lot of states were interest has been heavily declining in recent times (like for example California and Utah).

It is also worth noting that this dataset – like all others – is not free of bias sources. One important problem, even if we’re only talking about interest for pizza, is that the search “Pizza in state” might be skewed if very large cities are present in the state (someone in LA might be much more prone to google “Pizza in LA” than “Pizza in California”), there is also a problem when state/city names overlap (like “Pizza in Washington” for example). What we are interpreting as a lack of interest, might therefore just be a shift towards more specific ways of searching for pizza places.

I will try to expand this “google trends” technique using the names of the biggest cities and capitals in the US. This will allow us to better understand the city Vs state bias and clarify whether West Virginia and South Dakota are really great places to open up pizza restaurants.

Leave a Reply

Your email address will not be published.