Where should I open my new pizza place? Part 2: Crawling for success.

In the first part of this series of posts we explored how we could find data to attempt to predict the best zip code to open up a pizza restaurant in the US. We ended up with a database containing thousands of pizza places, with location, name and menu information. However this data is incomplete, as it basically describes restaurants but says nothing about their “success”. Today I will show you how we can actually do some web crawling to generate a variable that we can use as a proxy for restaurant success.

import requests
from time import sleep
from tqdm import tnrange
import numpy as np

df1 = pd.read_csv("pizza_db.csv")

headers = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) '
                  'Gecko/20100101 Firefox/61.0'),
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'

df1["is_open"] = 1

for i in tnrange(len(df1)):   
    q  = "{} {}".format(df1["name"].iat[i], df1["postalCode"].iat[i])
    q = q.replace(" ", "+")   
    url = "https://www.google.com/search?q={}".format(q)
    html = requests.get(url, headers=headers)    
    df1["is_open"].iat[i] = int("Permanently closed" in html.text)

After having a database that is decently descriptive and representative of pizza restaurants our next challenge is to figure out which variables are related with “restaurant success”. But what does “success” actually mean? We must first establish some measure of what we consider success so that we can actually determine if pizza restaurants are more or less likely to be successful in one location compared to another.

Ideally we would want to have something very tangible, like net pretax income for the pizza places, their amount of orders or similar information. However, this information is all private – hence impossible to access in any massive way – so we need to settle for a much simpler metric. The easiest metric -as it is publicly available – is to determine whether the restaurant is open or closed. At the very least a successful restaurant must be open.

The problem with this information is that it’s not within our database, so we need to construct it from scratch. To do this I devised a very simple web crawler – code above – that does systematic google searches in order to populate the “is_open” variable in the dataframe. The crawler searches the name of the restaurant plus the zip code and then determines whether the word “Permanently closed” is displayed in the search results. This is all we need to do since google will always display these words in the review snippet if the restaurant has been reported to be closed.

Funny Cartoons About Pizza | Randy Glasbergen - Glasbergen Cartoon

Of course, it’s not that simple. Google does not like systematic usage of their search engine for this type of database construction, so they will block you if you massively crawl using their search engine. To avoid this we must use a proper header (this is why I specified a “User-Agent” line) and we must introduce a random lag between crawling attempts in order to avoid being blocked. This works for this small case, just 3510 searches, but for any more complicated crawling you’ll actually need to do some additional work to prevent this from happening. This is a nice article explaining how to avoid blocking while doing this type of crawl.

In the end we obtain a database with a lot of descriptive variables and a single label indicating success (whether the restaurant is opened or closed). Within the next post in this series I’m gonna share with you an initial analysis of how location relates with success and whether or not we can construct simple location based models to determine our probability of having a successful pizza restaurant using only location data. We will also see how we can expand the meaningfulness or our location data using information from other databases (such as the US census).

1 comment

Leave a Reply

Your email address will not be published.