Understanding rental markets using freely available AirBnB data: Part 1

Rental markets provide large amounts of data where value extraction can help all players involved. A better constructed and priced market can benefit the people who rent, the people who own real estate and the management companies that aggregate and manage these properties. Although free data sources for rental markets are not very common we do have insiderairbnb which provides free-to-use data from airbnb that can help us gain insights into how these rental markets work. This is my first dive into rental markets and the insiderairbnb data so feel free to comment if you think I have made any mistakes processing or using this information!

Distribution of the number of reviews per month

As a small initial experiment I started to look into data for AirBNB in Seattle, given the structure of the insiderairbnb data I wanted to see if I could successfully predict the amount of reviews per month a given airbnb received. This is a useful data point to predict since it is expected to be significantly correlated with the number of stays and therefore the money earned by the person who owns the property and whoever manages it. It is also useful since – as a land lord for example – we might be interested in owning a property where the number of reviews can increase as quickly as possible, rather than being solely focused on the occupancy rate (which sadly is not available in the dataset).

The first initial step I carried out was to remove variables that are only going to be spuriously related with the reviews per month or other expressions of it. I therefore removed all the following variables from my dataset: “number_of_reviews”, “name”, “host_name”, “last_review”, “host_id” and “id”. With the data now cleaner for prediction I then proceeded to hot-encode all of the remaining string variables (“neighborhood”, “neighborhood_group” and “room_type”). This basically means that instead of having one “room_type” column, I will now have one column for each potential value of “room_type” with either a 0 or a 1, depending on whether that is true or not for that particular row. I also proceeded to remove all rows in the dataset where the “reviews_per_month” were zero, since this just means no valid data is available for evaluation.

Feature importance from simple single decision tree classification (max_depth=5, criterion=”entropy”, class_weight=”balanced”)

After the dataset was clean I then needed to decide exactly what to predict. Trying a regression approach to predict the exact review_per_month was likely not going to be a very easy problem given the characteristics of the data at hand (7795 examples with 114 data columns). Meaning that I was more likely to succeed by simplifying the problem to ensure my inferences would be stronger. In order to do this I decided to instead attempt to predict if the reviews per month were going to be higher than the average value expected from the dataset. The first image on this post shows you the distribution of reviews per month and the average (vertical blue line). In this particular distribution 60% of the values are below the blue line, so I need to have an accuracy of more than 60% to have an edge from just randomly picking from the dataset.

With this now in mind I started with a decision tree model so that I could understand which features were important overall and get an idea about what data interactions might lead to better predictions. I tuned a decision tree classifier using a stratified 10-fold cross validation, obtaining an average accuracy of around 70%, a significant edge over randomness. However – more importantly – the decision tree classifier offers us significant insights into things like variable importance and – even better – we can get a direct look into how tree is constructed, something far less tractable when we move to more complex tree-based methods.

Price as a function of the reviews per month, the relationship is clearly non-linear by a large degree

The second image in this post shows you the importance for variables with an importance metric greater than zero. These are the variables that are most effective in generating a good classification split. The amount of availability plays a key role – since we clearly expect properties that are available more to get more reviews – so does the minimum nights, the calculated number of listings for that same host, the price and room types of the rental property. These variables all make intuitive sense since we would expect rental properties that are more expensive to be rented less and properties that belong to more experience owners – people with more properties in airbnb – to do better overall as they are likely going to be better at handling the properties themselves.

Note though that these relationships are far from linear in nature and require a lot of additional insight to be useful (see the third plot in this post, for an example using price). Just because someone rents a property at 100 USD instead of 200 USD that does not mean that the person will automatically get way more reviews per month. The type of room, the availability, the minimum nights, etc, all play a big role in determining the role that price is playing in the mix.

The decision tree that was created from my model building. Click to expand.

One of the key advantages of building a simple decision tree model first is the level of learning we can do just by looking at how the actual tree is built. The image directly above shows you the graphical representation of the decision tree that was constructed from this data. The first split is actually an availability based split, followed by a price split to the left and a calculated number of listing for same host to the right. This is telling us that for places with no availability restrictions (0 value) the price is the next most important thing, while if there are availability restrictions then you need to look at proxy of the host’s experience. As the tree goes deeper we then get into the other variables, eventually asking questions as specific as whether the airbnb is located in a given neighborhood or not.

As you can see the topic is very interesting and a ton of potential insights can be made just from looking at one city’s data and understanding the outputs of a simple model, such as a decision tree. With this information we could already start helping someone make a better decision about what the ideal property to have the highest number of reviews per month would look like in Seattle. In the next part I’ll look into more complex models and the learning curves we get for this problem.


1 comment

Leave a Reply

Your email address will not be published.