7.4 Exercises

Modify the caret SVM example to use TF-IDF weighting for the DFM rather than raw term frequency. Does this affect the performance of the SVM model when evaluated using cross-validation?
Using the movie review sentiment data, train a logistic regression model with the glmnet method using the caret::train() function (ensure you have the glmnet package installed). Use cross-validation for training. Compare the performance of this model with that of the SVM model using a confusion matrix, an ROC-AUC plot, and a calibration plot.
Investigate the impact of various pre-processing steps (e.g. removing numbers, stemming versus non-stemming and using bigrams) on the performance of the Naive Bayes model with the Manifesto Project data. Evaluate performance using cross-validation and compare confusion matrices.
Investigate how to deal with imbalanced classes in text classification. For a dataset with imbalanced classes, research a relevant technique (e.g. the ROSE package or the sampling options in the trainControl function of the caret package) and apply it during cross-validated training. Does this improve the performance of the minority class compared to training without addressing the imbalance?
For either the SVM or Naive Bayes model, research how to tune hyperparameters using the tuneGrid argument in caret::train(). Implement a simple hyperparameter tuning process using cross-validation, then report on the performance of the best-tuned model.