Understanding rental markets using freely available AirBnB data: Part 3

In part 2 of this series of posts I looked at how using Seattle airbnbinsider data we could predict whether a given property was going to get above average reviews per month with an almost 80% accuracy. In this post we are now going to explore the creation of a regression model using this data to determine how effective we can be at actually predicting the specific reviews per month an airbnb can get. We will also look at a simple example of how a variable can affect our prediction target and how this can help us determine pricing for airbnb properties.

Real reviews per month as a function of the predicted value for the training and testing sets

Using an LGBM Regressor with the same input variables as shared in my last post we can build a model that attempts to predict the actual reviews per month for our properties. The image above shows the results of making predictions on training and testing sets created from a simple 70/30 randomized split of the data. The testing set curve in the above example has an R2 of 0.48 which, although far from perfect, shows a decent ability to make predictions.

Our problems are mainly in the accurate calculation of low review-per-month values while we do a better job at predicting values that are further up the spectrum. For lower values we tend to predict far higher reviews per month than what is actually observed within the data. We can see that this deviation exists even in our training data, so it’s not surprising that we show the same difficulty with this predictions in the testing set, given that we already do poorly with them in training. However it is clear that we are far from perfect all across the curve, as our current R2 value is just very far away from the ideal.

Variable importances for the LGBM Regressor model

Variable importance values show a very similar behavior when compared to our classifier model, with latitude and longitude being the most important variables, followed by price. Since price and cleaning fees are still very important variables, we can study them and see the effect that they have on our testing set when we change them in different ways. This can help us build a VERY primitive pricing model where we can change the price for an airbnb to determine how we can maximize the reviews per month it can get.

The graph below shows the ratio of the new prediction to the original prediction after a 20% reduction in price. As you can see for properties that have high reviews per month a 20% reduction in price forecasts little change, while for the lower part of the spectrum we sometimes predict increases in the number of reviews of almost 10x. This is definitely not a reality – a wonderful discovery that someone deeply mispriced their rental – but rather a fluke that’s related with how terrible the model is at predicting values around the low reviews per month. This can be even worse with the cleaning_fee where a property is predicted to increase its reviews per month by 1600x with a 20% reduction in price!

New prediction/Original prediction as a function of the original prediction after a price reduction of 20% for 200 cases in the testing set

While we cannot judge any single case prediction using plots like the above with a regression model that’s expected to have dispersion this terrible, we can look at some general trends and make some very simple inferences. For example the data does show that – in very general terms – a 20% reduction in price will tend to increase your reviews per month, especially if you’re on the lower end of the reviews per month. By how much, is not something we can say with this model, but at least we can see that there’s a general trend in that direction.

In the next post of this series we will look at improving this regression model, to hopefully get it to a point where we can make some deeper analysis about pricing and airbnb properties in Seattle.

Leave a Reply

Your email address will not be published. Required fields are marked *