H6751 Assignment: Yelp Reviews Data (Solutions)

Introduction

This assignment uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.

Description of the data:

  • yelp.csv contains the dataset. It is stored in the course repository (in the data directory), so there is no need to download anything from the Kaggle website.
  • Each observation (row) in this dataset is a review of a particular business by a particular user.
  • The stars column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
  • The text column is the text of the review.

Goal: Predict the star rating of a review using only the review text.

Tip: After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

Task 1

Read yelp.csv into a Pandas DataFrame and examine it.

In [2]:
# read yelp.csv using a relative path
import pandas as pd
path = 'yelp.csv'
yelp = pd.read_csv(path)
In [3]:
# examine the shape
yelp.shape
Out[3]:
(10000, 10)
In [4]:
# examine the first row
yelp.head(1)
Out[4]:
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0
In [5]:
# examine the class distribution
yelp.stars.value_counts().sort_index()
Out[5]:
1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

Task 2

Create a new DataFrame that only contains the 5-star and 1-star reviews.

In [6]:
# filter the DataFrame using an OR condition
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# alternatively, use the 'loc' method to accomplish the same thing
yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]
In [7]:
# examine the shape
yelp_best_worst.shape
Out[7]:
(4086, 10)

Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the review text as the only feature and the star rating as the response.

  • Hint: Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.
In [8]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars
In [9]:
# split X and y into training and testing sets

#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
In [10]:
# examine the object shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(3064,)
(1022,)
(3064,)
(1022,)

Task 4

Use CountVectorizer to create document-term matrices from X_train and X_test.

In [12]:
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
In [13]:
# fit and transform X_train into X_train_dtm
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape
Out[13]:
(3064, 16825)
In [14]:
# transform X_test into X_test_dtm
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape
Out[14]:
(1022, 16825)

Task 5

Use Multinomial Naive Bayes to predict the star rating for the reviews in the testing set, and then calculate the accuracy and print the confusion matrix.

In [15]:
# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
In [16]:
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)
Out[16]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [17]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
In [18]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
Out[18]:
0.9187866927592955
In [19]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
Out[19]:
array([[126,  58],
       [ 25, 813]])

Task 6 (Challenge)

Calculate the null accuracy, which is the classification accuracy that could be achieved by always predicting the most frequent class.

  • Hint: Evaluating a classification model explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!
In [20]:
# examine the class distribution of the testing set
y_test.value_counts()
Out[20]:
5    838
1    184
Name: stars, dtype: int64
In [21]:
# calculate null accuracy
y_test.value_counts().head(1) / y_test.shape
Out[21]:
5    0.819961
Name: stars, dtype: float64
In [22]:
# calculate null accuracy manually
838 / float(838 + 184)
Out[22]:
0.8199608610567515

Task 7 (Challenge)

Browse through the review text of some of the false positives and false negatives. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

  • Hint: Evaluating a classification model explains the definitions of "false positives" and "false negatives".
  • Hint: Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?
In [23]:
# first 10 false positives (meaning they were incorrectly classified as 5-star reviews)
X_test[y_test < y_pred_class][0:10]
Out[23]:
2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
8283    Don't know where I should start. Grand opening...
2765    Went last week, and ordered a dozen variety. I...
2839    Never Again,\nI brought my Mountain Bike in (w...
321     My wife and I live around the corner, hadn't e...
1919                                         D-scust-ing.
Name: text, dtype: object
In [27]:
# false positive: model is reacting to the words "good", "impressive", "nice"
X_test[1781]
Out[27]:
"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating."
In [28]:
# false positive: model does not have enough data to work with
X_test[1919]
Out[28]:
'D-scust-ing.'
In [29]:
# first 10 false negatives (meaning they were incorrectly classified as 1-star reviews)
X_test[y_test > y_pred_class][0:10]
Out[29]:
7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
3448    I was there last week with my sisters and whil...
6050    I went to sears today to check on a layaway th...
2504    I've passed by prestige nails in walmart 100s ...
2475    This place is so great! I am a nanny and had t...
241     I was sad to come back to lai lai's and they n...
Name: text, dtype: object
In [30]:
# false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"
X_test[4963]
Out[30]:
'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\'m in. The shoe SA\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \n\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\'s is not only the prompt attention of SA\'s, but the fact that they aren\'t rushing around trying to help 35 people at once. The SA\'s at Barney\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I have also never experienced a "high pressure" sale situation here.\n\nAll in all, Barneys is pricey, and there is no getting around it. But, um, so is Neiman\'s and that place is a crock. Anywhere that ONLY accepts American Express or their charge card and then treats you like scum if you aren\'t carrying neither is no place that I want to spend my hard earned dollars. Yay Barneys!'

Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.

  • Hint: Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the feature_count_ and class_count_ attributes of the Naive Bayes model object.
In [31]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)
Out[31]:
16825
In [32]:
# first row is one-star reviews, second row is five-star reviews
nb.feature_count_.shape
Out[32]:
(2, 16825)
In [33]:
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]
five_star_token_count = nb.feature_count_[1, :]
In [34]:
# create a DataFrame of tokens with their separate one-star and five-star counts
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')
In [35]:
# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1
In [36]:
# first number is one-star reviews, second number is five-star reviews
nb.class_count_
Out[36]:
array([ 565., 2499.])
In [37]:
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]
In [38]:
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star
In [39]:
# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('five_star_ratio', ascending=False).head(10)
Out[39]:
one_star five_star five_star_ratio
token
fantastic 0.003540 0.077231 21.817727
perfect 0.005310 0.098039 18.464052
yum 0.001770 0.024810 14.017607
favorite 0.012389 0.138055 11.143029
outstanding 0.001770 0.019608 11.078431
brunch 0.001770 0.016807 9.495798
gem 0.001770 0.016006 9.043617
mozzarella 0.001770 0.015606 8.817527
pasty 0.001770 0.015606 8.817527
amazing 0.021239 0.185274 8.723323
In [40]:
# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows
tokens.sort_values('five_star_ratio', ascending=True).head(10)
Out[40]:
one_star five_star five_star_ratio
token
staffperson 0.030088 0.0004 0.013299
refused 0.024779 0.0004 0.016149
disgusting 0.042478 0.0008 0.018841
filthy 0.019469 0.0004 0.020554
unprofessional 0.015929 0.0004 0.025121
unacceptable 0.015929 0.0004 0.025121
acknowledge 0.015929 0.0004 0.025121
ugh 0.030088 0.0008 0.026599
fuse 0.014159 0.0004 0.028261
boca 0.014159 0.0004 0.028261

Task 9 (4 pts)

A fully-connected neural network was developed for a 3-class document classification problem and in the Bag-of-words features space, these documents are not linearly separated. The Python implementation based on keras could be found in the below cell. The code could run to completion. However, there are some mistakes in the implementation. What are these mistakes and explain the rationale for your answer?

In [ ]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import to_categorical
from sklearn.feature_extraction.text import CountVectorizer
# define the corpus
documents = ["t0 t1 t4 t4", "t3 t0 t2", "t2 t2 t4",  "t4 t3 t2", "t3 t2 t0 t1 t1", "t1 t0 t0 t3",    
                       "t2 t2 t1 t3"]
y_int = [0,1,0,2,0,1, 2]
bow = CountVectorizer()
sparse_fea = bow.fit_transform(documents)
data = sparse_fea.todense()
labels = to_categorical(y_int)
model = Sequential()
model.add(Dense(20, input_shape=(5,)))
model.add(Dense(3))
model.compile(optimizer='rmsprop',
                         loss='categorical_crossentropy',
                         metrics=['accuracy'])
model.fit(data, labels)
new_documents = ["t4 t3 t2 t2 t5 t0", "t4 t3 t3"]
test_sparse_fea = bow.fit_transform(new_documents)
test_data = test_sparse_fea.todense()
predict_label = model.predict_classes(test_data)
print(predict_label)
print("The predicted labels of the new documents are {}".format(predict_label))

Write you Answer as below

  1. transform instead of fit_transform should be used to generate test_sparse_fea

  2. Non-linear activation should be used in the two dense layers and the last layer needs to be with softmax activation.