Skip to content

JalajVora/Text-Analytics-with-Multi-Class-and-Imbalanced-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text-Analytics-with-Multi-Class-and-Imbalanced-Learning

This project is part of Advanced Topics in Machine Learning subject. Further detailed description of the project can be known in the documentation of the Project.

Problem: Genre Identification on (a sub-set of) Gutenberg Corpus

Consider this set of books belonging to the 19th Century English Fiction 1.

The data set is created from Project Gutenberg2. The data set consists of about 1000 books and roughly 10 genres. The task here consists of detection (i.e. multi-class classification) of genre3 of a book. Each data-point in this classification task is a fiction book with a label (genre). Please note the following three main challenges tackled:

  1. Extraction of features that are relevant to fiction books, which may include ideas like sentiment, setting4 and so on, using appropriate libraries.
  2. Outline of all the models used and why and how model selection was performed.
  3. Explaination of how the evaluation of the model is being done and how the data set is to be partitioned while taking into account potential challenges like class imbalances and similar.