Airbnb Price Prediction: Feature Engineering and Unstructured Image Analysis | Making Models (III)
Can we predict Airbnb rental price from images alone?
This is part three of a series documenting the end to end process to develop a generalized linear model designed to output Airbnb rental price based on a number of features. As a whole, the series will include a description of dataset analysis, advice for additional data collection via web scraping methods, feature engineering (specifically for unstructured images), model selection and results. Readers should find this document helpful when developing their own predictive models, or when looking for a framework to organize their thoughts. Most importantly, I hope to demystify some of the process behind “data science” by breaking down a typical workflow into distinct and modular activities that can be reproduced for many types of problems. If you find it helpful, or have a question, please feel free to leave a comment below and I will answer to the best of my ability.
In the last section we supplemented our dataset with additional information from a 3rd party — namely the United States census bureau. Today, we will take a macroscopic approach to our solution design and apply transformations that will improve our model’s ability to use these additional data points. Additionally, we will explore an approach to solve the interesting problem of how to turn an image into features that our model can use.
This part will be broken into two sections:
- Basic image analysis using OpenCV
- Feature engineering & regularization of variables
At the end of this section, we will have transformed our rental thumbnail images into quantifiable metrics, as well as combined variables to create normalized values that will improve our predictive ability.
Image Analysis using OpenCV
As in previous sections, we will be using Python. Today, we have installed the OpenCV package — an open source tool that allows us to deconstruct images and enable computer vision functionality for our model.
Recall from our previous exercise that we have created a sample user experience to help determine variables that may be important indicators for log_price. Although we were able to gather information about the general neighborhood, what about the apartment itself? A potential renter will rely heavily on photos of the space before deciding to make the decision to book — we must try to replicate this with a limited sample size of images.
When starting Airbnb, founder and CEO Brian Chesky would offer to take professional photos of host’s home to increase click through and booking rates. He would go from house to house with a borrowed camera to improve the user experience via high resolution photos of homes. Hosts would comment on the increase in traffic after this change — no coincidence here; there is some information being stored in these photos that influence a renter’s decision.
Taking a step back, we can think of a few things in images that may impact the price:
- Are the floors carpet or hardwood?
- Are the walls painted or wallpaper? Paintings or posters?
- Does the host have plants? Are they alive and healthy?
- Is the picture well-lit and inviting?
To keep our scope narrow, we will pull four values from the image that may be of value:
- Brightness
2. # of red pixels
3. # of blue pixels
4. # of green pixels
[Authors note: I also considered including image resolution; however, due to our provided dataset compressing all images to the same size we effectively lost information about the original image. For datasets where the images are of varying sizes I believe this would be a useful attribute to include.]
These four values should tell a story about our image that can act as a rough approximation of our human judgement. To perform this analysis, we pull in the OpenCV package provided free of charge for both academic and commercial use.
Let’s take a look at what is going on under the hood for each of these functions.
getImage accepts a url — this is the format that our dataset provides the photos in. To retrieve each photo we use the requests package to call the webpage and downloads the image to the current working directory before deleting it from local memory.
imgDetails uses the image we just downloaded and loads it via OpenCV’s built in reader. This function can tell us information about the size and resolution of each image. Since our dataset provided each image in a standard form, we won’t include these results in our final model.
channelSplit breaks down each image into an array of pixels and lists the B,G,R makeup for each. To summarize these results, we have taken the mean # for each to give us an abbreviated view of the image composition.
getBrightness first converts the image to greyscale, then extracts the brightness of each pixel. These values are averaged to give an idea of how light or dark each image appears.
Let’s look at an example for brightness:
All together, this suite of functions will allow us to represent our single image as four separate features. Although basic, it is important to include these attributes due to the known effect that image has on price from our example above.
Feature Engineering & Regularization of Variables
By now we have considerably widened our dataset and included external variables and characteristics that be believe are important to predict log_price. However, these variables vary wildly in magnitude; from percentages to household wages, we now have columns where numbers range from zero to the tens of thousands. To ensure that each column is weighted appropriately we must critically examine how each variable impacts price and adjust accordingly. Let’s dive into some examples:
tract_median_fam_income seems like it would be a good indicator of the type of neighborhoods the tract encompasses.
median_house_age also may help us understand the types of accommodations we will be renting.
Both of these variables seem important, but their values are totally different. How do we allow our model to compare these features equally, while still preserving our intent?
For the same reason that we regularized log_price, we can apply the same transformation to many of these variables. For the purposes of analysis it is very convenient that minor changes in the natural log of a variable are almost directly correlated as percentage changes to the original variable.
Let’s make this transformation to our examples:
The range of our values has now decreased to within the same order of magnitude, but we have maintained important differences between each value.
Getting to the root of it, what we are looking for is to understand where each feature falls on the spectrum of possible values. That is to say, it is more important to understand how a variable compares to others in its range than to unrelated variables.
Another transformation we could apply to our variables is a calculation of the difference of each listings feature from the average. For example, knowing that a neighborhood mostly has houses that are 50 years old is a piece of information, but does not help us understand price. But if we know that this neighborhood is 10 years newer than the average we can hypothesize this would cause our price to be higher. In this case, the value for this feature would be <1, indicating that the house is younger than the mean. Let’s try this in practice:
As expected, we see that the mean for our calculated variable is 1, and our distribution is now +/- .5 of this. For features with major outliers we may see larger deviations that make natural log a better option, but this approach seems to work for most cases.
Let’s consider why these approaches work: feature engineering allows us to create calculated values that help get to the root of the problem. Our null hypothesis is that a change in median_house_age, median_income, or any other variable have no impact on log price. To validate this hypothesis, we consider the best way to represent each variable in a way that represents our intent of including it. For some of these, our trick of comparison against the mean will show how far removed our listing is from normal and our training set will assign a coefficient that is positive or negative based on the target variable. For others, we want to remove outliers or magnitude effects that detract from the focus of the feature, and will apply a natural logarithm to do so. These transformations are applied at the discretion of the modeler; I suggest playing with multiple permutations to determine which version results in a more accurate prediction.
This concludes part 3 of Making Models | Airbnb Price Prediction. We have now designed an approach that will allow us to customize our dataset. We have also made sense of unstructured images and created features that allow us to describe this quantitatively. In our final section, we will fit various models and examine how each impacts our prediction of log_price. Thank you for reading, and feel free to clap if you enjoyed.