Load data¶

import pandas as pd
import numpy as np

df_train = pd.read_csv("train.csv")

df_train.head()

train_text, train_label = df_train.Comment.tolist(), df_train.Outcome.tolist()

Feature Parts¶

By checking the data, the most key characteristics behind the data is that some data contain not only text, but also code which should be very informative of the labels. Then, we may arrive at some hypothesis:

Feature enginnering: design seven variables that I thought could be discriminative

df_train['num_word'] = df_train.Comment.str.split().apply(len)
df_train['num_char'] = df_train.Comment.str.len()
df_train['avelen_word'] = df_train.num_char/df_train.num_word.astype(float) # complex words may be useful
df_train['num_number'] = df_train.Comment.str.count('\d')
df_train['penct_number'] = df_train.num_number/df_train.num_word.astype(float)
df_train['num_punctuation'] = df_train.Comment.str.count('[^\w\s]')
df_train['penct_punctuation'] = df_train.num_punctuation/df_train.num_word.astype(float)

df_train.head()

Text Pre-processing and Text Feature Desgin:
- Here, the only preprocessing step was conducted as replacing all numbers with 1.
- Using a range of ngrams at the char-level would generate features for words and characters so that the model may learn features for code and words at the same time. Here, only two hyper-parameters are selected in TfidfVectorizer. One is ngram_range. Here, different single ngarm values including 1,2,3,4,5 and 6 could be explored to gain the intuition of the final range. The other one is min_df. A large number could reduce the feature dimensionality.

from sklearn.model_selection import train_test_split
import re

trainsub, valsub = train_test_split(df_train, test_size=0.1, random_state=1)

train_text = trainsub.Comment.tolist()

train_text = [re.sub("\d", "1", ele) for ele in train_text]

from sklearn.feature_extraction.text import TfidfVectorizer

tfvec = TfidfVectorizer(train_text,ngram_range=(1,6),analyzer='char', min_df=20)
train_textvec=tfvec.fit_transform(train_text)

train_textvec.shape

(45610, 317242)

train_fe= trainsub[['num_word', 'num_char', 'avelen_word',
                    'num_number','penct_number',
                   'num_punctuation', 'penct_punctuation']].to_numpy()

from scipy.sparse import hstack

X_train = hstack([train_textvec, train_fe])
y_train  = trainsub.Outcome.tolist()

val_text = valsub.Comment.tolist()
val_text = [re.sub("\d", "1", ele) for ele in val_text]
val_textvec=tfvec.transform(val_text)
val_fe= valsub[['num_word', 'num_char', 'avelen_word',
                    'num_number','penct_number',
                   'num_punctuation', 'penct_punctuation']].to_numpy()

X_val = hstack([val_textvec, val_fe])
y_val = valsub.Outcome.tolist()

Model Parts¶

Here, one linear model: logistic regression is used. Some powerful boosting models such as LightGBM or XGboost may improve the performance. Here, due to the time and computing limits, I did not try these methods.

from sklearn.linear_model import LogisticRegression

clf_models = []
val_scores = []
for c_param in [0.1, 1, 10, 100, 1000]:
    clf = LogisticRegression(random_state=1, C=c_param).fit(X_train, y_train)
    clf_models.append(clf)
    val_score = clf.score(X_val, y_val)
    val_scores.append(val_score)
    print(val_scores)

/Users/rui/pyenv/bt5153/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

[0.7002762430939227]
[0.7002762430939227, 0.728887134964483]
[0.7002762430939227, 0.728887134964483, 0.7332280978689818]

/Users/rui/pyenv/bt5153/lib/python3.6/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)

[0.7002762430939227, 0.728887134964483, 0.7332280978689818, 0.7117205998421468]
[0.7002762430939227, 0.728887134964483, 0.7332280978689818, 0.7117205998421468, 0.718034727703236]

Testing¶

def fea_transform(inputdf):
    inputdf['num_word'] = inputdf.Comment.str.split().apply(len)
    inputdf['num_char'] = inputdf.Comment.str.len()
    inputdf['avelen_word'] = inputdf.num_char/inputdf.num_word.astype(float)
    inputdf['num_number'] = inputdf.Comment.str.count('\d')
    inputdf['penct_number'] = inputdf.num_number/inputdf.num_word.astype(float)
    inputdf['num_punctuation'] = inputdf.Comment.str.count('[^\w\s]')
    inputdf['penct_punctuation'] = inputdf.num_punctuation/inputdf.num_word.astype(float)
    fe = inputdf[['num_word', 'num_char', 'avelen_word',
                    'num_number','penct_number',
                   'num_punctuation', 'penct_punctuation']].to_numpy()
    text = inputdf.Comment.tolist()
    text = [re.sub("\d", "1", ele) for ele in text]
    textvec=tfvec.transform(text)
    return hstack([textvec, fe])

df_test = pd.read_csv("test.csv")

df_test.head()

test_vec = fea_transform(df_test)

bestmodel_idx = val_scores.index(max(val_scores))

pred_labels= clf_models[2].predict(test_vec)

pred_labels

df_true = pd.read_csv("answer_key.csv")

true_label = df_true.Outcome.tolist()

from sklearn.metrics import accuracy_score

accuracy_score(true_label, pred_labels)

0.7255633938129615

How to improve the accuracy?

Complex models and Careful Hyper-parameter search
Ensemble various models that have decent performance and not strong correlation. How can we find those not perfect correlation models? We can explore different text features methods and machine learning models. For example, the combination of XGBoost and LightGBM may not be an optimal choice compared to the one of Neural Network and LightGBM. In addition, visualization of correlation of predicted probabilities from various models should be helpful.

	Comment	Outcome	Id
0	use variables in the outer function instead of...	1	18966
1	remember your knuth. "premature optimization i...	1	12559
2	if you're looking for something as nice as pyt...	1	36589
3	the readlines function will return a zero-leng...	1	1266
4	combining lindelof's and gregg lind's ideas: l...	1	9029

	Comment	Outcome	Id	num_word	num_char	avelen_word	num_number	penct_number	num_punctuation	penct_punctuation
0	use variables in the outer function instead of...	1	18966	52	343	6.596154	0	0.000000	25	0.480769
1	remember your knuth. "premature optimization i...	1	12559	59	329	5.576271	0	0.000000	14	0.237288
2	if you're looking for something as nice as pyt...	1	36589	49	271	5.530612	1	0.020408	27	0.551020
3	the readlines function will return a zero-leng...	1	1266	13	79	6.076923	0	0.000000	2	0.153846
4	combining lindelof's and gregg lind's ideas: l...	1	9029	80	435	5.437500	3	0.037500	51	0.637500

	Comment	Id
0	i use the tail() function: tail(vector, n=1) t...	69098
1	if x is your data.frame (or matrix) then x[ ,a...	72599
2	you need to use strptime() to convert the stri...	99130
3	you are, indeed, passing the object around and...	63567
4	i'm no r expert, but most languages use a refe...	97214