import pandas as pd
import numpy as np
df_train = pd.read_csv("train.csv")
df_train.head()
train_text, train_label = df_train.Comment.tolist(), df_train.Outcome.tolist()
By checking the data, the most key characteristics behind the data is that some data contain not only text, but also code which should be very informative of the labels. Then, we may arrive at some hypothesis:
df_train['num_word'] = df_train.Comment.str.split().apply(len)
df_train['num_char'] = df_train.Comment.str.len()
df_train['avelen_word'] = df_train.num_char/df_train.num_word.astype(float) # complex words may be useful
df_train['num_number'] = df_train.Comment.str.count('\d')
df_train['penct_number'] = df_train.num_number/df_train.num_word.astype(float)
df_train['num_punctuation'] = df_train.Comment.str.count('[^\w\s]')
df_train['penct_punctuation'] = df_train.num_punctuation/df_train.num_word.astype(float)
df_train.head()
Text Pre-processing and Text Feature Desgin:
from sklearn.model_selection import train_test_split
import re
trainsub, valsub = train_test_split(df_train, test_size=0.1, random_state=1)
train_text = trainsub.Comment.tolist()
train_text = [re.sub("\d", "1", ele) for ele in train_text]
from sklearn.feature_extraction.text import TfidfVectorizer
tfvec = TfidfVectorizer(train_text,ngram_range=(1,6),analyzer='char', min_df=20)
train_textvec=tfvec.fit_transform(train_text)
train_textvec.shape
train_fe= trainsub[['num_word', 'num_char', 'avelen_word',
'num_number','penct_number',
'num_punctuation', 'penct_punctuation']].to_numpy()
from scipy.sparse import hstack
X_train = hstack([train_textvec, train_fe])
y_train = trainsub.Outcome.tolist()
val_text = valsub.Comment.tolist()
val_text = [re.sub("\d", "1", ele) for ele in val_text]
val_textvec=tfvec.transform(val_text)
val_fe= valsub[['num_word', 'num_char', 'avelen_word',
'num_number','penct_number',
'num_punctuation', 'penct_punctuation']].to_numpy()
X_val = hstack([val_textvec, val_fe])
y_val = valsub.Outcome.tolist()
Here, one linear model: logistic regression is used. Some powerful boosting models such as LightGBM or XGboost may improve the performance. Here, due to the time and computing limits, I did not try these methods.
from sklearn.linear_model import LogisticRegression
clf_models = []
val_scores = []
for c_param in [0.1, 1, 10, 100, 1000]:
clf = LogisticRegression(random_state=1, C=c_param).fit(X_train, y_train)
clf_models.append(clf)
val_score = clf.score(X_val, y_val)
val_scores.append(val_score)
print(val_scores)
def fea_transform(inputdf):
inputdf['num_word'] = inputdf.Comment.str.split().apply(len)
inputdf['num_char'] = inputdf.Comment.str.len()
inputdf['avelen_word'] = inputdf.num_char/inputdf.num_word.astype(float)
inputdf['num_number'] = inputdf.Comment.str.count('\d')
inputdf['penct_number'] = inputdf.num_number/inputdf.num_word.astype(float)
inputdf['num_punctuation'] = inputdf.Comment.str.count('[^\w\s]')
inputdf['penct_punctuation'] = inputdf.num_punctuation/inputdf.num_word.astype(float)
fe = inputdf[['num_word', 'num_char', 'avelen_word',
'num_number','penct_number',
'num_punctuation', 'penct_punctuation']].to_numpy()
text = inputdf.Comment.tolist()
text = [re.sub("\d", "1", ele) for ele in text]
textvec=tfvec.transform(text)
return hstack([textvec, fe])
df_test = pd.read_csv("test.csv")
df_test.head()
test_vec = fea_transform(df_test)
bestmodel_idx = val_scores.index(max(val_scores))
pred_labels= clf_models[2].predict(test_vec)
pred_labels
df_true = pd.read_csv("answer_key.csv")
true_label = df_true.Outcome.tolist()
from sklearn.metrics import accuracy_score
accuracy_score(true_label, pred_labels)
How to improve the accuracy?
Complex models and Careful Hyper-parameter search
Ensemble various models that have decent performance and not strong correlation. How can we find those not perfect correlation models? We can explore different text features methods and machine learning models. For example, the combination of XGBoost and LightGBM may not be an optimal choice compared to the one of Neural Network and LightGBM. In addition, visualization of correlation of predicted probabilities from various models should be helpful.