Load data

In [1]:
import pandas as pd
import numpy as np
In [2]:
df_train = pd.read_csv("train.csv")
In [3]:
df_train.head()
Out[3]:
Comment Outcome Id
0 use variables in the outer function instead of... 1 18966
1 remember your knuth. "premature optimization i... 1 12559
2 if you're looking for something as nice as pyt... 1 36589
3 the readlines function will return a zero-leng... 1 1266
4 combining lindelof's and gregg lind's ideas: l... 1 9029
In [4]:
train_text, train_label = df_train.Comment.tolist(), df_train.Outcome.tolist()

Feature Parts

By checking the data, the most key characteristics behind the data is that some data contain not only text, but also code which should be very informative of the labels. Then, we may arrive at some hypothesis:

  1. Feature enginnering: design seven variables that I thought could be discriminative
In [5]:
df_train['num_word'] = df_train.Comment.str.split().apply(len)
df_train['num_char'] = df_train.Comment.str.len()
df_train['avelen_word'] = df_train.num_char/df_train.num_word.astype(float) # complex words may be useful
df_train['num_number'] = df_train.Comment.str.count('\d')
df_train['penct_number'] = df_train.num_number/df_train.num_word.astype(float)
df_train['num_punctuation'] = df_train.Comment.str.count('[^\w\s]')
df_train['penct_punctuation'] = df_train.num_punctuation/df_train.num_word.astype(float)
In [6]:
df_train.head()
Out[6]:
Comment Outcome Id num_word num_char avelen_word num_number penct_number num_punctuation penct_punctuation
0 use variables in the outer function instead of... 1 18966 52 343 6.596154 0 0.000000 25 0.480769
1 remember your knuth. "premature optimization i... 1 12559 59 329 5.576271 0 0.000000 14 0.237288
2 if you're looking for something as nice as pyt... 1 36589 49 271 5.530612 1 0.020408 27 0.551020
3 the readlines function will return a zero-leng... 1 1266 13 79 6.076923 0 0.000000 2 0.153846
4 combining lindelof's and gregg lind's ideas: l... 1 9029 80 435 5.437500 3 0.037500 51 0.637500
  1. Text Pre-processing and Text Feature Desgin:

    • Here, the only preprocessing step was conducted as replacing all numbers with 1.
    • Using a range of ngrams at the char-level would generate features for words and characters so that the model may learn features for code and words at the same time. Here, only two hyper-parameters are selected in TfidfVectorizer. One is ngram_range. Here, different single ngarm values including 1,2,3,4,5 and 6 could be explored to gain the intuition of the final range. The other one is min_df. A large number could reduce the feature dimensionality.
In [7]:
from sklearn.model_selection import train_test_split
import re
In [8]:
trainsub, valsub = train_test_split(df_train, test_size=0.1, random_state=1)
In [9]:
train_text = trainsub.Comment.tolist()
In [10]:
train_text = [re.sub("\d", "1", ele) for ele in train_text]
In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [31]:
tfvec = TfidfVectorizer(train_text,ngram_range=(1,6),analyzer='char', min_df=20)
train_textvec=tfvec.fit_transform(train_text)
In [32]:
train_textvec.shape
Out[32]:
(45610, 317242)
In [33]:
train_fe= trainsub[['num_word', 'num_char', 'avelen_word',
                    'num_number','penct_number',
                   'num_punctuation', 'penct_punctuation']].to_numpy()
In [34]:
from scipy.sparse import hstack
In [35]:
X_train = hstack([train_textvec, train_fe])
y_train  = trainsub.Outcome.tolist()
In [36]:
val_text = valsub.Comment.tolist()
val_text = [re.sub("\d", "1", ele) for ele in val_text]
val_textvec=tfvec.transform(val_text)
val_fe= valsub[['num_word', 'num_char', 'avelen_word',
                    'num_number','penct_number',
                   'num_punctuation', 'penct_punctuation']].to_numpy()
In [37]:
X_val = hstack([val_textvec, val_fe])
y_val = valsub.Outcome.tolist()

Model Parts

Here, one linear model: logistic regression is used. Some powerful boosting models such as LightGBM or XGboost may improve the performance. Here, due to the time and computing limits, I did not try these methods.

In [38]:
from sklearn.linear_model import LogisticRegression

clf_models = []
val_scores = []
for c_param in [0.1, 1, 10, 100, 1000]:
    clf = LogisticRegression(random_state=1, C=c_param).fit(X_train, y_train)
    clf_models.append(clf)
    val_score = clf.score(X_val, y_val)
    val_scores.append(val_score)
    print(val_scores)
/Users/rui/pyenv/bt5153/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
[0.7002762430939227]
[0.7002762430939227, 0.728887134964483]
[0.7002762430939227, 0.728887134964483, 0.7332280978689818]
/Users/rui/pyenv/bt5153/lib/python3.6/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
[0.7002762430939227, 0.728887134964483, 0.7332280978689818, 0.7117205998421468]
[0.7002762430939227, 0.728887134964483, 0.7332280978689818, 0.7117205998421468, 0.718034727703236]

Testing

In [39]:
def fea_transform(inputdf):
    inputdf['num_word'] = inputdf.Comment.str.split().apply(len)
    inputdf['num_char'] = inputdf.Comment.str.len()
    inputdf['avelen_word'] = inputdf.num_char/inputdf.num_word.astype(float)
    inputdf['num_number'] = inputdf.Comment.str.count('\d')
    inputdf['penct_number'] = inputdf.num_number/inputdf.num_word.astype(float)
    inputdf['num_punctuation'] = inputdf.Comment.str.count('[^\w\s]')
    inputdf['penct_punctuation'] = inputdf.num_punctuation/inputdf.num_word.astype(float)
    fe = inputdf[['num_word', 'num_char', 'avelen_word',
                    'num_number','penct_number',
                   'num_punctuation', 'penct_punctuation']].to_numpy()
    text = inputdf.Comment.tolist()
    text = [re.sub("\d", "1", ele) for ele in text]
    textvec=tfvec.transform(text)
    return hstack([textvec, fe])
In [40]:
df_test = pd.read_csv("test.csv")
In [41]:
df_test.head()
Out[41]:
Comment Id
0 i use the tail() function: tail(vector, n=1) t... 69098
1 if x is your data.frame (or matrix) then x[ ,a... 72599
2 you need to use strptime() to convert the stri... 99130
3 you are, indeed, passing the object around and... 63567
4 i'm no r expert, but most languages use a refe... 97214
In [ ]:
test_vec = fea_transform(df_test)
In [ ]:
bestmodel_idx = val_scores.index(max(val_scores))
In [43]:
pred_labels= clf_models[2].predict(test_vec)
In [ ]:
pred_labels
In [45]:
df_true = pd.read_csv("answer_key.csv")
In [46]:
true_label = df_true.Outcome.tolist()
In [47]:
from sklearn.metrics import accuracy_score
In [48]:
accuracy_score(true_label, pred_labels)
Out[48]:
0.7255633938129615

How to improve the accuracy?

  1. Complex models and Careful Hyper-parameter search

  2. Ensemble various models that have decent performance and not strong correlation. How can we find those not perfect correlation models? We can explore different text features methods and machine learning models. For example, the combination of XGBoost and LightGBM may not be an optimal choice compared to the one of Neural Network and LightGBM. In addition, visualization of correlation of predicted probabilities from various models should be helpful.