This NLP project focuses on predicting the dialect of Arabic texts using advanced machine learning techniques. With the use of random forest and RNN models, the project aims to accurately classify Arabic dialects. As the Arabic language is known for its complex grammar and varied letter formations, NLP problems related to Arabic are particularly challenging. Moreover, with numerous countries speaking the language, each country has its own unique dialect. Therefore, the objective of this project is to develop a robust model that accurately predicts the dialect based on the input text.
The dataset utilized in this project is a collection of Arabic sentences labeled with their corresponding dialects from five distinct countries, namely Egypt ('EG'), Lebanon ('LB'), Libya ('LY'), Sudan ('SD'), and Morocco ('MA'). It is worth noting that the dataset is imbalanced, with the majority of the data originating from the 'EG' dialect.you can find the original paper of the dataset here.
- Data Fetching
- Data Preprocessing
- Model Training
- Deployment
The Random Forest Model achieved an Macro-F1 score of 70%, while the RNN model achieved an Macro-F1 score of 82%.
This project was developed by:
- Muhammad Raafat
- Mahmoud Mohsen
- Sherif Ahmed
- Fatma Gamal