Skip to content

KhaLee2307/text-classification_machine-translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Text Classification and Machine Translation

  • This is a practice exercise in CS221 - Natural Language Processing (University of Information Technology - VNUHCM)
  • We build a standard process in building models to solve NLP problems (Text Classification and Machine Translation):
    1. Preprocessing
    • Vietnamese: Normalize Unicode, Standardize Vietnamese punctuation, Separating Vietnamese words, Convert lowercase letters, Sentence normalization
    • English: Punctuation Standardization, Convert lowercase letters
    1. Prepare data: Divide the dataset into train, validation, and test sets.
    2. Word embedding: a technique in natural language processing (NLP) that represents words in a mathematical form, typically as vectors, which can be easily processed by machine learning algorithms.
    • There are several methods of word embedding: One-Hot Encoding, Count-based Embedding (LSA), Prediction-based Embedding (Word2Vec and GloVe), Contextual Embedding (BERT)
    • In this project, we use 2 popular methods: CountVectorizer and TfIdfVectorizer.
    1. Model selection and training:
    • Text Classification: Support Vector Machine (SVM), Naive Bayes (NB), Logistic Regression (LR)
    • Machine Translation: Encoder-Decoder LSTM (because of resource and time limitations I only train on 100 epochs and a small portion of data)
    1. Evaluation:
    • Text Classification: Compare the performance of the models on 4 metrics: accuracy, precision, recall, f1_score (specific results are detailed in the notebook)
  • For these two problems, we both use the Domain_specific_EVCorpus_Done dataset
  • Additionally, we use libraries (underthesea, gensim, unicodedata) for preprocessing. To handle Vietnamese problems, we can also use VnCoreNLP and pyvi.

About

Some tasks in NLP: Text Classification and Machine Translation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published