Skip to content

epicalekspwner/BigScaleAnalytics2021

Repository files navigation

banner_main

banner_team

🕵️ Project Description

To improve one’s skills in a new foreign language, it is important to read texts in this language. But in order to make learning really effective, these text have to be at the reader’s language level. However, it is difficult to find texts that are close to someone’s knowledge level (A1 to C2).

This project aims to build a model for English speakers that predicts the difficulty of a French written text. This can be then used, e.g., in a recommendation system, to recommend texts that are appropriate for someone’s language level. If someone is at A1 French level, it is inappropriate to present a text at B2 level, as she won’t be able to understand it. Ideally, a text should have many known words and may have a few words that are unknown so that the person can improve.

This project is iterative and will be conducted in multiple milestones that are due throughout the semester:

🎯 AIcrowd Final result

image

🧠🧠 Application link

https://epicalekspwner-bsa2021.uc.r.appspot.com/

image

📚 Review of the Existing Literature

🥨 Learning German

🥖 Learning French

🐟 Learning Portuguese

💭 How Do We Intend to Solve the Problem?

After much thought, we plan to approach this project as a classification problem. After building our model and training it, the output of out model is to predict a discrete class label, i.e., in our case, predict the difficulty of a unlabelled sentence (from A1 to C2). Modeling this problem as a classification problem will also allow us to evaluate the model in terms of accuracy, whose interpretation is intuitive in our case.

The feature engineering we wish to explore can be reprensented by the following points:

📃 Words

  • Categorize the different types of words (Part-Of-Speech Tagger)
  • Count the word frequency for each categorie (i.e., the more frequent a word is, the easier it should be to be assimilated)
  • Create a dictionary for each level that contains the most frequent words
  • Analyze the grouping of letters
  • Deal with deceptive cognates (i.e. words that resemble French ones, but do not have the same meaning) (list of 139 most common deceptive cognates)
  • Deal with cognates (i.e. words that resemble French ones and from which one can deduce the meaning): 2 possible options
  • 1st solution: list of 58,000 english words (need to be translated into French and then lemmetized both to obtain similar roots)
  • 2nd solution: look at the suffixes (e.g. words ending by "tion" have a high probabilty to have a straightforward translation in English)

📃 Sentences

  • Measure the length of sentences (i.e., the shorter the sentence, the easier it is to understand and vice versa)
  • Count punctuation (i.e., a more complex sentence will tend to contain more punctuation to handle grammatical difficulty)
  • Count the different types of words (i.e., a complex sentence will tend to be composed of a combination of several different types of words (noun, verb, adverb, pronoun, preposition, conjunction, etc.)

🤖 Potential Algorithms

  • Logistic Regression
  • Naive Bayes Classifier
  • K-nearest Neighbors
  • Decision Tree

📦 Libraries We Intend to Use

  • NLTK Snowball (Stemmer)
  • Spacy French LEFFF
  • French specific libraries
  • And more 😉

---------------------------------------------------------------------------------------------------------------------------------

🤘 First iteration: Creating/Evaluating the Model

Dataset used: Team Amazon dataset

First Model Using Text Classification in Google Cloud Natural Language Automl

  • Model type: Single-label classification
  • Test items: 102
  • Precision: 80%
  • Recall: 35.29%

🗜️ Confusion Matrix

Note 1: We can see that our model has difficulties at predicting B2 and C1 levels. One of the reasons might be the assessment of these levels by our team. One solution could be reevaluating these two levels more carefully.

Note 2: Our model is super sensible to the length of the sentence. If we put mutiple times a pretty simple sentence, it could end up with up to a C2 level. We will need to reinvestigate this concern further!

Custom Model: Features Engineering

Done Feature Name Method
✔️ Sentences lengths Return the length of the sentence
✔️ Type of words Return a dict {"Word": "Type of word"}
✔️ Number of punctuation Return the number of punctuation there is in the sentence
✔️ Deceptive cognates Return the number of deceptive cognomes there is in the sentence (see Graph (a))
✔️ Cognates Return list of 14,000 possible cognates and the similarity between the two roots (French and English)
✔️ Common words for each category Creation of list with the most common words for each category

Graph (a): Deceptive Cognomes

We can see on the above graph that the more complex a sentence is, the more deceptive cognates (aka false friends) there are.

---------------------------------------------------------------------------------------------------------------------------------

🤘🤘 Second iteration: Iterate & Improve

Dataset used: TAs' dataset

Screenshots of the Prototype

The interface was fully coded in HTML and consisted of a simple input box which allowed to analysis the sentence provided by the user.

The application was coded considering only the case where the user would actually enter a sentence and the case where the user would click on "predict" without any input would systematically return an error. If there was actually an input, the application would return a extremely simple response on a second HTML page.

Some Changes From Milestone 2

For this version, we used the datatset provided by the TAs to do the models, whereas in Milestone 2 we used our dataset which was biased.

We used several models from basic one (linear regression) to more complex ones (GC). From the various models, we used the model from Google Cloud Natural Language Classification which is the best model in terms of accuracy on Aicrowd with an accuracy of 53.3%. We tried a few combinations but unfortunatly it did not improve our ranking on Aicrowd.

What Did We Use as Librairies or Services?

  • Cloud services: Google AutoML (Regression and NLP), Google Colab, Google App Engine
  • NLP Librairies: Spacy (Multi-langual package), NLTK (Multi-langual package), Camembert (French package)
  • Machine Learning librairies: Scikit-Learn
  • App: Flask + Python

📐 General Architecture

🧠 Cognates Problem

The idea: when we were young our teachers gave us some "tips" on how to detect cognates from french to english. We had to look for the suffixes ...

🧠 Deceptive Cognates

We also took into account false friends and created a function that count the number of deceptive cognates in order to integrate it into our models.

🌡️ Models Evaluation

Model Parameters Internal Accuracy/R² or Google Cloud Precision/Recall/R² Accuracy Aicrowd submission Evaluation Note
Regression Algo (RA) 📈 📉
Linear regression (1) param: None R²: 0.31 None None None
Logistic regression (2) param: standardization, penalty = 'l2',solver='lbfgs', cv=8, max_iter=3000, random_state=72 R²: 0.37 None None None
Support vector machine Regression (3) param: StandardScaler(), SVR(C=5, epsilon=0.8), round() R²: 0.46 Accuracy: 0.49 None Good model
Regression Algo Camembert package (RACAM) 📈 📉 🧀
Linear regression (1) param: None R²: 0.36 None None None
Logistic regression (2) param: standardization, penalty = 'l2',solver='lbfgs', cv=8, max_iter=3000, random_state=72 R²: 0.41 None None None
Support vector machine Regression (3) param: StandardScaler(), SVR(C=1, epsilon=1), round() R²: 0.54 None None We can see that the french package Camembert is increasing the R². As the POS_tag is more precised it gives more details and so the results are better
Classification Algo 📁📂(CA)
Support vector classifier (1) param: C=6 F1: 41% None image None
Support vector classifier Camembert (1.2) param: C=3 F1: 45% None image None
Logistic Regression (2) param: 'LR__C': 6, 'LR__max_iter': 1000 F1: 39% None image None
KNNeighbours (3) param: 'knn__leaf_size': 10, 'knn__n_neighbors': 17, 'knn__p': 1, 'knn__weights': 'uniform' F1: 38% None image None
Decision Trees (4) param: 'DT__max_depth': 3, 'DT__min_samples_split': 5 F1: 33% None image None
Random Forest (5) param: 'RF__bootstrap': True, 'RF__criterion': 'entropy', 'RF__n_estimators': 18 F1: 45% None image None
Google Cloud ⛅️(GC)
Classification problem (1) param: None Precision: 58.51% & Recall: 35.41% & F1: 44% Accuracy: 53.3% image Good at predicting A2 and B1, it is a good base to use for combined models
Classification problem lemma(2) param:None Precision: 60.94% & Recall: 29.96% & F1: 40% None image Good at predicting A1, B2 and C2 level. training the dataset on highly preprocessed sentences + lemmatized elements helped us to get better results at predicting these levels.
Regression problem (3) param: None R²: 0.497 None None None
Classification problem (3) param: dataset sentence length reduced Precision: 61.6% & Recall: 43.02% & F1: 51% Accuracy: 51.1% image Reduced length sentences + ponctuations in order to get a model less biaised by the length. The results are not as good as expected.
Algo Combination
GC (1) + GC (2) param: (1): A2,B1,C1 & (2): A1,B2,C2 None Accuracy: 51.5% (base (2)) & 52.8% (base (1)) None Not as good as expected. Maybe we should have used the probabilities given in order to combined the two models because here we only based the combination on the confusion matrix results.
GC (1) + RA (3) param: None None Accuracy: 44% None None

---------------------------------------------------------------------------------------------------------------------------------

✔️ Final Application

The final model we used in our application is the GC (1)

Home Page

  • Eye-catcher
  • Input field for the sentence to be analysed (processed when "Predict" is clicked or deleted when "Reset" is clicked)
  • Brief presentation of the project
  • Direct link to our team's Github repo

Result Page

  • Input sentence and its translation
  • Predicted level of difficulty with its corresponding probability
  • Prediction matrix containing the remaining levels and their respective probabilities
  • Explaination of the different levels
  • Dependency parse visualizer

Inexistant Input Error Page

  • Warning to inform the user that the analysis could not be performed because no sentence was entered in the field provided
  • Input field to continue without having to return to the home page to be able to enter a sentence

Backend Error Page

  • Warning to inform the user that the analysis could not be performed because there is a malfunction in the backend
  • Invite the user to try to perform the analysis again later as he cannot do more at the moment

🧛🧛🧛‍♀️ Team work repartition

Aleksandar:

  • Application (flask, UI, etc)
  • Google App engine

Gauthier:

  • Literature review
  • Readme
  • Ponctuation analysis function

Maxime-Lucie:

  • Google AutoML: NLP and regression
  • Amazon Notebook: EDA, preprocessing, models (single and combined) and analysis of the models

🗄️ Sources

🧠 Cognates

🗃️ Dataset

📗 Books

  • Barnes, Djurna. 1986. Le Bois de la nuit. Points roman.
  • Césaire, Aimé. 1939. Cahier d'un retour au pays natal. Paris: Pierre Bordas.
  • De Beauvoir, Simone. 1949. Le Deuxième Sexe. Paris: NRF.
  • De La Fontaine, Jean. 1778. Fables de La Fontaine. Fides.
  • De Maupassant, Guy. 1885. Bel-Ami. Paris: Victor Havard.
  • De Maupassant, Guy. 1887. Le Horla. Paris: Paul Ollendorff.
  • De Saint-Exupéry, Antoine. 1943. Le petit prince. Paris: Gallimard.
  • Diome, Fatou. 2003. Le ventre de l'Atlantique. Paris: Anne Carrière.
  • Flaubert, Gustave. 1857. Madame Bovary. Paris: Michel Lévy frères.
  • Echenoz, Jean. 2001. Jérôme Lindon. Paris: Éditions de Minuit.
  • Pennac, Daniel. 2007. Chagrin d'école. Paris: Éditions Gallimard.
  • Proust, Marcel. 1913. Du côté de chez Swann. Paris: Bernard Grasset.
  • Proust, Marcel. 1918. À l'ombre des jeunes filles en fleurs. Paris: Éditions Gallimard.
  • Queneau, Raymond. 1947. Exercices de style. Paris: Gallimard.
  • Rostand, Edmond. 1898. Cyrano de Bergerac. Paris: Charpentier et Fasquelle.
  • Schmitt, Éric-Emmanuel. 2001. Monsieur Ibrahim et les fleurs du Coran. Paris: Albin Michel.
  • Verne, Jules. 1896. Vingt mille lieues sous les mers. Paris: Hetzel.
  • Voltaire. 1759. Candide. Genève: Gabriel Cramer.
  • Zola, Émile. 1883. Au bonheur des dames. Paris: Georges Charpentier.
  • Zola, Émile. 1877. L’Assommoir. Paris: Georges Charpentier.

🔬 Studies

📰 Online Articles

🌐 Other Websites

About

Big-Scale Analytics: Group Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published