- This is a practice exercise in CS221 - Natural Language Processing (University of Information Technology - VNUHCM)
- We build a standard process in building models to solve NLP problems (Text Classification and Machine Translation):
- Preprocessing
- Vietnamese: Normalize Unicode, Standardize Vietnamese punctuation, Separating Vietnamese words, Convert lowercase letters, Sentence normalization
- English: Punctuation Standardization, Convert lowercase letters
- Prepare data: Divide the dataset into train, validation, and test sets.
- Word embedding: a technique in natural language processing (NLP) that represents words in a mathematical form, typically as vectors, which can be easily processed by machine learning algorithms.
- There are several methods of word embedding: One-Hot Encoding, Count-based Embedding (LSA), Prediction-based Embedding (Word2Vec and GloVe), Contextual Embedding (BERT)
- In this project, we use 2 popular methods: CountVectorizer and TfIdfVectorizer.
- Model selection and training:
- Text Classification: Support Vector Machine (SVM), Naive Bayes (NB), Logistic Regression (LR)
- Machine Translation: Encoder-Decoder LSTM (because of resource and time limitations I only train on 100 epochs and a small portion of data)
- Evaluation:
- Text Classification: Compare the performance of the models on 4 metrics: accuracy, precision, recall, f1_score (specific results are detailed in the notebook)
- For these two problems, we both use the Domain_specific_EVCorpus_Done dataset
- Additionally, we use libraries (underthesea, gensim, unicodedata) for preprocessing. To handle Vietnamese problems, we can also use VnCoreNLP and pyvi.
-
Notifications
You must be signed in to change notification settings - Fork 0
Some tasks in NLP: Text Classification and Machine Translation
License
KhaLee2307/text-classification_machine-translation
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Some tasks in NLP: Text Classification and Machine Translation
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published