Using the Amazon dataset provided in class, I tested various binary classifiers outside the course material using the sklearn library including XGBoost and RBF SVM, and ultimately used Logistic Regression, Naive Bayes and a Perceptron model. Of these, Naive Bayes, whose Multinomial and Gaussian types I could not use due to synthetic features I have explained below, often performed the weakest amongst the three models. Logistic regression often performed the best when its inverse regularization strength hyperparameter was tuned to be between 0.1 and 10.
For the dataset, since we were constrained in the number of numeric parameters, since reviewText, summary and category were the only features available, I used the combination of reviewText and summary (known as category within the data frame) and several features within it such as average word length, punctuation, uniqueness of words, etc. This would help characterize the sentiment and quality of the review given since, e.g. longer words and fewer punctuation in a review might indicate a positive review with a higher overall score (which is the column withheld in the train.csv which we had to train our model for).
For the binary classification task, among the three models I indicated above, logistic regression, with its inverse regularization strength hyperparameter (C) tuned between 1 and 10 outperformed Naive Bayes and Perceptron. Naive Bayes's confusion matrix was often unequally biased towards false positives, indicating that it consistently performed poorly compared to the perceptron and logreg models. For preprocessing, I used TF-IDF on both, the reviewText-summary combination and category separately since, by trial and error, I saw that even it contributed to the state of reviews and the overall score in the dataset. I also used a HashingVectorizer for the category to recur occurrences of it across reviews. I performed an 80-20 split on the training and validation set for the models to test their performance measured by accuracy, ROC AUC and F1 macro scores, before evaluating on the test set meant for the Kaggle competition.
Logistic regression frequently outperformed other models, especially beyond cutoff=1 and had the highest accuracy, F1 macro and AUC/ROC score. Multiclass classification followed in a similar vein, with logistic regression outperforming the other two models. However, since the feature space became rather large, I had to reduce the dimensions of the columns by pruning the rather expensive HashingVectorizer call in the pipeline. Besides this I reused the same synthetic numeric features that captured sentimental and linguistic patterns from the reviews for the classification task.