There are lots of applications of text classification in the commercial world. For example, in the article: Multi-Class Text Classification with Scikit-Learn, Susan Li described a Machine Learning method to assign new "Consumer Complaint Narrative" to one of 12 categories.
For this project, I used mostly the same methodology to assign a joke to one of 34 categories (according to the Central Comedy page).
We have 34 imbalanced classes.
- Some Tf-idf term weighting results:
- Model results:
Applying in 5 balanced classes.
- Exploring some Logistic Regression results:
- The model apparently does not differentiate "money jokes" to "looking good jokes" very well.
- "dirty jokes" and "insults jokes" are often classified as "miscellaneous jokes", probably because "miscellaneous jokes" has an ample definition. So i needs more investigation.
Tuning the models parameters properly: as we can see in my git hub, the model are flat.
Finding the appropriate metrics to compare the model’s performance.