7.4 Exercises
Modify the
caretSVM example to use TF-IDF weighting for the DFM rather than raw term frequency. Does this affect the performance of the SVM model when evaluated using cross-validation?Using the movie review sentiment data, train a logistic regression model with the
glmnetmethod using thecaret::train()function (ensure you have theglmnetpackage installed). Use cross-validation for training. Compare the performance of this model with that of the SVM model using a confusion matrix, an ROC-AUC plot, and a calibration plot.Investigate the impact of various pre-processing steps (e.g. removing numbers, stemming versus non-stemming and using bigrams) on the performance of the Naive Bayes model with the Manifesto Project data. Evaluate performance using cross-validation and compare confusion matrices.
Investigate how to deal with imbalanced classes in text classification. For a dataset with imbalanced classes, research a relevant technique (e.g. the ROSE package or the sampling options in the
trainControlfunction of thecaretpackage) and apply it during cross-validated training. Does this improve the performance of the minority class compared to training without addressing the imbalance?For either the SVM or Naive Bayes model, research how to tune hyperparameters using the
tuneGridargument incaret::train(). Implement a simple hyperparameter tuning process using cross-validation, then report on the performance of the best-tuned model.