Detect the Language using NLP.
This is a project to detect the language of a given text using machine learning algorithms such as K-Nearest Neighbors, Random Forest, and Multinomial Naive Bayes.
This project requires pandas, numpy, matplotlib, seaborn, re, sklearn libraries. To install them, run:
The dataset used in this project is "Language Detection.csv" and is available on Kaggle.
The first step of the project is to preprocess the data. This includes cleaning the data by removing the symbols, numbers, and converting the text to lowercase. This step is implemented in the clean_function method.
The project involves visualizing the distribution of the different languages present in the data using a bar plot and pie chart.
The next step is to select the machine learning algorithms to build the model. In this project, we are using K-Nearest Neighbors, Random Forest, and Multinomial Naive Bayes algorithms. The dataset is split into training and testing datasets, and the models are trained on the training data.
Finally, the performance of the models is evaluated using accuracy score and confusion matrix. The best performing model is selected based on these metrics.
The accuracy of the MNB model is 0.98, which is very good and indicates that our model is performing well.