This assignment uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.
Description of the data:
yelp.csv contains the dataset. It is stored in the course repository (in the data directory), so there is no need to download anything from the Kaggle website.Goal: Predict the star rating of a review using only the review text.
Tip: After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
# for Python 2: use print only as a function
from __future__ import print_function
Read yelp.csv into a Pandas DataFrame and examine it.
# read yelp.csv using a relative path
import pandas as pd
path = 'yelp.csv'
yelp = pd.read_csv(path)
# examine the shape
yelp.shape
# examine the first row
yelp.head(1)
# examine the class distribution
yelp.stars.value_counts().sort_index()
Create a new DataFrame that only contains the 5-star and 1-star reviews.
# filter the DataFrame using an OR condition
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]
# alternatively, use the 'loc' method to accomplish the same thing
yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]
# examine the shape
yelp_best_worst.shape
Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the review text as the only feature and the star rating as the response.
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars
# split X and y into training and testing sets
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# examine the object shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Use CountVectorizer to create document-term matrices from X_train and X_test.
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# fit and transform X_train into X_train_dtm
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape
# transform X_test into X_test_dtm
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape
Use Multinomial Naive Bayes to predict the star rating for the reviews in the testing set, and then calculate the accuracy and print the confusion matrix.
# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
Calculate the null accuracy, which is the classification accuracy that could be achieved by always predicting the most frequent class.
# examine the class distribution of the testing set
y_test.value_counts()
# calculate null accuracy
y_test.value_counts().head(1) / y_test.shape
# calculate null accuracy manually
838 / float(838 + 184)
Browse through the review text of some of the false positives and false negatives. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
# first 10 false positives (meaning they were incorrectly classified as 5-star reviews)
X_test[y_test < y_pred_class][0:10]
# false positive: model is reacting to the words "good", "impressive", "nice"
X_test[1781]
# false positive: model does not have enough data to work with
X_test[1919]
# first 10 false negatives (meaning they were incorrectly classified as 1-star reviews)
X_test[y_test > y_pred_class][0:10]
# false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"
X_test[4963]
Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.
feature_count_ and class_count_ attributes of the Naive Bayes model object.# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)
# first row is one-star reviews, second row is five-star reviews
nb.feature_count_.shape
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]
five_star_token_count = nb.feature_count_[1, :]
# create a DataFrame of tokens with their separate one-star and five-star counts
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')
# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1
# first number is one-star reviews, second number is five-star reviews
nb.class_count_
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star
# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('five_star_ratio', ascending=False).head(10)
# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows
tokens.sort_values('five_star_ratio', ascending=True).head(10)
A fully-connected neural network was developed for a 3-class document classification problem and in the Bag-of-words features space, these documents are not linearly separated. The Python implementation based on keras could be found in the below cell. The code could run to completion. However, there are some mistakes in the implementation. What are these mistakes and explain the rationale for your answer?
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils import to_categorical
from sklearn.feature_extraction.text import CountVectorizer
# define the corpus
documents = ["t0 t1 t4 t4", "t3 t0 t2", "t2 t2 t4",  "t4 t3 t2", "t3 t2 t0 t1 t1", "t1 t0 t0 t3",    
                       "t2 t2 t1 t3"]
y_int = [0,1,0,2,0,1, 2]
bow = CountVectorizer()
sparse_fea = bow.fit_transform(documents)
data = sparse_fea.todense()
labels = to_categorical(y_int)
model = Sequential()
model.add(Dense(20, input_shape=(5,)))
model.add(Dense(3))
model.compile(optimizer='rmsprop',
                         loss='categorical_crossentropy',
                         metrics=['accuracy'])
model.fit(data, labels)
new_documents = ["t4 t3 t2 t2 t5 t0", "t4 t3 t3"]
test_sparse_fea = bow.fit_transform(new_documents)
test_data = test_sparse_fea.todense()
predict_label = model.predict_classes(test_data)
print(predict_label)
print("The predicted labels of the new documents are {}".format(predict_label))
transform instead of fit_transform should be used to generate test_sparse_fea
Non-linear activation should be used in the two dense layers and the last layer needs to be with softmax activation.