March 25, 2016
The code for this project can be found on my Github account.
Yelp is an online service where users can sign up and submit reviews for businesses in their local area, like restaurants, tailors, and mechanics. Each review rates the business from 1 star to 5 stars and must contain a written explanation; moreover, other users can give "Useful" votes to reviews they find useful.
In this project I take the Yelp dataset from the Yelp Dataset Challenge and attempt to predict the usefulness of the reviews, as measured by user votes, from other features of the reviews. It turns out that most of the potential predictive power can be achieved by a simple linear model based on the length of the review, or on the length of the review together with the number of stars given. It is possible, however, to produce an improved model by including features of the text of the review (à la natural language processing).
I converted the dataset's JSON files to
.csv files using a script by Paul Butler. Because of computation time and RAM limitations, I took 5% (111,260 reviews) of the dataset as a training set and a different 5% as a test set.
After plotting the data, I found that the distributions of the votes and of the number of words in a review are heavily skewed:
After considering various possibilities, I decided to use a log transformation on both:
We can see that applying a log transformation to each variable reduces the skewness significantly, especially for the number of words per review, which becomes nearly symmetric after the transformation. Thereafter, I scaled the review length to have a mean of 0 and a standard deviation of 1 in preparation for further analysis.
If we plot the number of words in a review against usefulness, we can see that generally a greater number of words predicts greater utility, but there seems to be some non-linearity at both ends of the scale: if the review is too long, perhaps people don’t bother to read it, or in any case the marginal benefit of the length diminishes.
I attempted to predict the usefulness of the review from length, using a random forest model to account for the non-linearity. This was able to explain about 11% of the variance in the number of usefulness votes. However, a simpler method is to try to predict the usefulness with a linear model, but with dummy variables for very short or very long reviews (under 20 or over 700 words).
We can see the results of the linear regression in R here:
Call: lm(formula = votes_useful ~ ., data = reviews_length) Coefficients: Estimate Std. Error t value Pr(>|t|) n_words 0.222471 0.001459 152.473 < 2e-16 stars1 0.040228 0.004493 8.954 < 2e-16 stars2 0.027351 0.005091 5.372 7.78e-08 stars4 0.022813 0.004019 5.677 1.38e-08 stars5 -0.027025 0.003840 -7.037 1.97e-12 under20 0.241593 0.005828 41.453 < 2e-16 over700 0.086196 0.019381 4.447 8.69e-06 Multiple R-squared: 0.1186, Adjusted R-squared: 0.1186
The coefficients of a linear model are very easily interpreted. We see that
n_words, the variable corresponding to the length of each review, is assigned a relatively high coefficient; unsurprisingly, a longer review tends on average to be more useful. Moreover, the large coefficients for the
over700 indicator variables (previously described) help to account for the nonlinear relationship between review length and the number of usefulness votes. All of these coefficients are statistically significant, with the highest p-value being on the order of 10-6, so we can be fairly confident that our results are not spurious.
Note that the adjusted R-squared value is given to be 0.1186, meaning that this linear model explains 11.8% of the variance in the number of usefulness votes. This linear model with dummy variables works about as well as the random forest, so I based further analysis on this model for the sake of simplicity. (If we can explain the same amount of variance with a simpler model, we may as well use the simpler model, which is easier for us to interpret than a complicated random forest.)
The next step was to try to engineer new features to improve on this simple linear model. Accordingly, I constructed a text feature matrix using the
quanteda R package, creating a word-frequency variable for every word which shows up at least 800 times in the training and test sets taken together. Without this choice, we would end up with a staggering number of variables and probably end up with overfitting problems.
Next, I used L1-regularized linear regression with these word-frequency features along with the original variables to predict the usefulness of each review.
The new model continues to assign a large amount of weight to the number of words, with a generally positive association between length and usefulness, but with a negative coefficient to the property of having more than 700 words. This corresponds with the previous plot of review usefulness vs. length, where we observed that returns on review length diminished after a certain point. This does not necessarily signify that long reviews are not useful, but merely that a correction is necessary in order to avoid putting too much weight on the length.
We can look at the words most predictive of usefulness according to the model:
limo, ya, ass, yelpers, shit, won, dude, ...
Interestingly, some of the words which are most predictive of usefulness are quite negative in tone. This might be related to the fact that extremely negative reviews are rare, and consequently seem more informative. This is supported by the relatively high regression coefficient given to the review rating the business one star out of five, suggesting that negative reviews really do provide what users perceive to be very useful information.
Using our model to predict the number of usefulness votes for the test set, we find that the correlation between the predicted votes and the actual votes is 0.369, which implies that our model can explain 13.6% of the variance in the number of usefulness votes. This is only a modest improvement over the linear model (which explained 11.8% of the variance), but it is nevertheless a real improvement.
Since the word-frequency matrix was created using the vocabulary from both the training and test data, I needed to do more work in order to produce a method of creating predictions for entirely new reviews outside of that data. In essence, it was necessary to count the number of times each feature (corresponding to a unique word) occurs in the new text and then to apply the coefficients of the model to the results in order to produce the new predictions.
Applying this method to a new business from a third set of reviews (another 5% of the Yelp dataset), visual inspection of the results in RStudio verifies that the model does a good of distinguishing somewhat useful reviews from useless reviews. However, it also indicates (as noted above) that this results largely from the length of the review.
For the business in question, we can look at the review which is predicted by the model to be the most useful:
Some decent 'Que on my side of town! I'll admit, I had to check out this place after it was featured on "Diners, Drive-Ins, and Dives" on the Food Network. But without this tidbit, I might have never knew the place existed! Funniest thing about John Mull's is that it is basically located in the middle of residential homes. Yes! When you GPS the address you might think you're being led to the wrong place. But keep truckin' because the BBQ joint is on property of what looks to be a ranch house. John Mull's eating area shares the common space with what is essentially a butcher block. Get a whole cow, some eggs, and then a meal of BBQ afterwards? I don't see why not. While the market has been around for some time, my review basically goes over the BBQ station. It used to be just a shack when they first started, [...] (942 words)
And for the two least useful reviews:
1. Good Hot Links!
2. Great tasting food!
The difference in length is evident.
We can use Spearman's rank-order correlation coefficient to compare the correlation of two rankings. This method simply calculates the standard correlation between two ordinal rankings of a set of items (where the rankings go 1, 2, 3, ...). Doing so for our third set of reviews, I calculated a correlation of 0.41 between the usefulness ranking predicted by the model and the actual usefulness ranking. This is somewhat higher compared to when we group the reviews by business, calculate the rank-order correlation for the reviews associated with each business, and take the average of these correlation coefficients, which results in an average rank-order correlation of 0.34. Our predicted rankings are somewhat accurate, but they could definitely be better.
Overall, the model does improve on the simple linear model based on length, but not by much. However, there are suggestions in our results that yet more improvement is possible. For example, the model tends to put a high weight on words with a negative tone. This suggests that it might be possible to obtain better results by applying more sophisticated NLP methods, such as sentiment analysis.
This is a fairly common situation: most of the potential predictive power can be obtained fairly easily. Improvement is possible, but rapidly diminishing improvements require the application of rapidly increasing resources in terms of computing power, algorithmic complexity, and technical skill. In any particular case, one must determine to what degree it is worthwhile to apply these resources for the sake of the goal in question.