GitHub - epicalekspwner/BigScaleAnalytics2021: Big-Scale Analytics: Group Project

🕵️ Project Description

To improve one’s skills in a new foreign language, it is important to read texts in this language. But in order to make learning really effective, these text have to be at the reader’s language level. However, it is difficult to find texts that are close to someone’s knowledge level (A1 to C2).

This project aims to build a model for English speakers that predicts the difficulty of a French written text. This can be then used, e.g., in a recommendation system, to recommend texts that are appropriate for someone’s language level. If someone is at A1 French level, it is inappropriate to present a text at B2 level, as she won’t be able to understand it. Ideally, a text should have many known words and may have a few words that are unknown so that the person can improve.

This project is iterative and will be conducted in multiple milestones that are due throughout the semester:

🎯 AIcrowd Final result

🧠🧠 Application link

https://epicalekspwner-bsa2021.uc.r.appspot.com/

📚 Review of the Existing Literature

🥨 Learning German

Vlachos, Michalis & Lappas, Theodoros. 2011. "Ranking German texts by comprehensibility for foreign document retrieval." Proceedings of the 34th international ACM SIGIR conference on enriching in Information Retrieval.

🥖 Learning French

Mesnager, Jean. 2011. "Le vocabulaire et son enseignement : Le vocabulaire et son enseignement." Ministère de l'Éducation nationale, de la Jeunesse et des Sports.

🐟 Learning Portuguese

Curto, Pedro & Mamede, Nuno & Baptista, Jorge. 2015. "Automatic text difficulty classifier." Assisting the selection of adequate reading materials for European Portuguese teaching. Proceedings of CSEDU: 36-44.

💭 How Do We Intend to Solve the Problem?

After much thought, we plan to approach this project as a classification problem. After building our model and training it, the output of out model is to predict a discrete class label, i.e., in our case, predict the difficulty of a unlabelled sentence (from A1 to C2). Modeling this problem as a classification problem will also allow us to evaluate the model in terms of accuracy, whose interpretation is intuitive in our case.

The feature engineering we wish to explore can be reprensented by the following points:

📃 Words

Categorize the different types of words (Part-Of-Speech Tagger)
Count the word frequency for each categorie (i.e., the more frequent a word is, the easier it should be to be assimilated)
Create a dictionary for each level that contains the most frequent words
Analyze the grouping of letters
Deal with deceptive cognates (i.e. words that resemble French ones, but do not have the same meaning) (list of 139 most common deceptive cognates)
Deal with cognates (i.e. words that resemble French ones and from which one can deduce the meaning): 2 possible options
1^st solution: list of 58,000 english words (need to be translated into French and then lemmetized both to obtain similar roots)
2^nd solution: look at the suffixes (e.g. words ending by "tion" have a high probabilty to have a straightforward translation in English)

📃 Sentences

Measure the length of sentences (i.e., the shorter the sentence, the easier it is to understand and vice versa)
Count punctuation (i.e., a more complex sentence will tend to contain more punctuation to handle grammatical difficulty)
Count the different types of words (i.e., a complex sentence will tend to be composed of a combination of several different types of words (noun, verb, adverb, pronoun, preposition, conjunction, etc.)

🤖 Potential Algorithms

Logistic Regression
Naive Bayes Classifier
K-nearest Neighbors
Decision Tree

📦 Libraries We Intend to Use

NLTK Snowball (Stemmer)
Spacy French LEFFF
French specific libraries
And more 😉

---------------------------------------------------------------------------------------------------------------------------------

🤘 First iteration: Creating/Evaluating the Model

Dataset used: Team Amazon dataset

First Model Using Text Classification in Google Cloud Natural Language Automl

Model type: Single-label classification
Test items: 102
Precision: 80%
Recall: 35.29%

🗜️ Confusion Matrix

Note 1: We can see that our model has difficulties at predicting B2 and C1 levels. One of the reasons might be the assessment of these levels by our team. One solution could be reevaluating these two levels more carefully.

Note 2: Our model is super sensible to the length of the sentence. If we put mutiple times a pretty simple sentence, it could end up with up to a C2 level. We will need to reinvestigate this concern further!

Custom Model: Features Engineering

Done	Feature Name	Method
✔️	Sentences lengths	Return the length of the sentence
✔️	Type of words	Return a dict {"Word": "Type of word"}
✔️	Number of punctuation	Return the number of punctuation there is in the sentence
✔️	Deceptive cognates	Return the number of deceptive cognomes there is in the sentence (see Graph (a))
✔️	Cognates	Return list of 14,000 possible cognates and the similarity between the two roots (French and English)
✔️	Common words for each category	Creation of list with the most common words for each category

Graph (a): Deceptive Cognomes

We can see on the above graph that the more complex a sentence is, the more deceptive cognates (aka false friends) there are.

---------------------------------------------------------------------------------------------------------------------------------

🤘🤘 Second iteration: Iterate & Improve

Dataset used: TAs' dataset

Screenshots of the Prototype

The interface was fully coded in HTML and consisted of a simple input box which allowed to analysis the sentence provided by the user.

The application was coded considering only the case where the user would actually enter a sentence and the case where the user would click on "predict" without any input would systematically return an error. If there was actually an input, the application would return a extremely simple response on a second HTML page.

Some Changes From Milestone 2

For this version, we used the datatset provided by the TAs to do the models, whereas in Milestone 2 we used our dataset which was biased.

We used several models from basic one (linear regression) to more complex ones (GC). From the various models, we used the model from Google Cloud Natural Language Classification which is the best model in terms of accuracy on Aicrowd with an accuracy of 53.3%. We tried a few combinations but unfortunatly it did not improve our ranking on Aicrowd.

What Did We Use as Librairies or Services?

Cloud services: Google AutoML (Regression and NLP), Google Colab, Google App Engine
NLP Librairies: Spacy (Multi-langual package), NLTK (Multi-langual package), Camembert (French package)
Machine Learning librairies: Scikit-Learn
App: Flask + Python

📐 General Architecture

🧠 Cognates Problem

The idea: when we were young our teachers gave us some "tips" on how to detect cognates from french to english. We had to look for the suffixes ...

🧠 Deceptive Cognates

We also took into account false friends and created a function that count the number of deceptive cognates in order to integrate it into our models.

🌡️ Models Evaluation

Model	Parameters	Internal Accuracy/R² or Google Cloud Precision/Recall/R²	Accuracy Aicrowd submission	Evaluation	Note
Regression Algo (RA) 📈 📉
`Linear regression (1)`	param: `None`	R²: 0.31	`None`	`None`	`None`
`Logistic regression (2)`	param: standardization, penalty = 'l2',solver='lbfgs', cv=8, max_iter=3000, random_state=72	R²: 0.37	`None`	`None`	`None`
`Support vector machine Regression (3)`	param: StandardScaler(), SVR(C=5, epsilon=0.8), round()	R²: 0.46	Accuracy: 0.49	`None`	Good model

Regression Algo Camembert package (RACAM) 📈 📉 🧀
`Linear regression (1)`	param: `None`	R²: 0.36	`None`	`None`	`None`
`Logistic regression (2)`	param: standardization, penalty = 'l2',solver='lbfgs', cv=8, max_iter=3000, random_state=72	R²: 0.41	`None`	`None`	`None`
`Support vector machine Regression (3)`	param: StandardScaler(), SVR(C=1, epsilon=1), round()	R²: 0.54	`None`	`None`	We can see that the french package Camembert is increasing the R². As the POS_tag is more precised it gives more details and so the results are better

Classification Algo 📁📂(CA)
`Support vector classifier (1)`	param: C=6	F1: 41%	`None`		`None`
`Support vector classifier Camembert (1.2)`	param: C=3	F1: 45%	`None`		`None`
`Logistic Regression (2)`	param: 'LR__C': 6, 'LR__max_iter': 1000	F1: 39%	`None`		`None`
`KNNeighbours (3)`	param: 'knn__leaf_size': 10, 'knn__n_neighbors': 17, 'knn__p': 1, 'knn__weights': 'uniform'	F1: 38%	`None`		`None`
`Decision Trees (4)`	param: 'DT__max_depth': 3, 'DT__min_samples_split': 5	F1: 33%	`None`		`None`
`Random Forest (5)`	param: 'RF__bootstrap': True, 'RF__criterion': 'entropy', 'RF__n_estimators': 18	F1: 45%	`None`		`None`

Google Cloud ⛅️(GC)
`Classification problem (1)`	param: `None`	Precision: 58.51% & Recall: 35.41% & F1: 44%	Accuracy: 53.3%		Good at predicting A2 and B1, it is a good base to use for combined models
`Classification problem lemma(2)`	param:`None`	Precision: 60.94% & Recall: 29.96% & F1: 40%	`None`		Good at predicting A1, B2 and C2 level. training the dataset on highly preprocessed sentences + lemmatized elements helped us to get better results at predicting these levels.
`Regression problem (3)`	param: `None`	R²: 0.497	`None`	`None`	`None`
`Classification problem (3)`	param: dataset sentence length reduced	Precision: 61.6% & Recall: 43.02% & F1: 51%	Accuracy: 51.1%		Reduced length sentences + ponctuations in order to get a model less biaised by the length. The results are not as good as expected.

Algo Combination
`GC (1) + GC (2)`	param: (1): A2,B1,C1 & (2): A1,B2,C2	`None`	Accuracy: 51.5% (base (2)) & 52.8% (base (1))	`None`	Not as good as expected. Maybe we should have used the probabilities given in order to combined the two models because here we only based the combination on the confusion matrix results.
`GC (1) + RA (3)`	param: `None`	`None`	Accuracy: 44%	`None`	`None`

---------------------------------------------------------------------------------------------------------------------------------

✔️ Final Application

The final model we used in our application is the GC (1)

Home Page

Eye-catcher
Input field for the sentence to be analysed (processed when "Predict" is clicked or deleted when "Reset" is clicked)
Brief presentation of the project
Direct link to our team's Github repo

Result Page

Input sentence and its translation
Predicted level of difficulty with its corresponding probability
Prediction matrix containing the remaining levels and their respective probabilities
Explaination of the different levels
Dependency parse visualizer

Inexistant Input Error Page

Warning to inform the user that the analysis could not be performed because no sentence was entered in the field provided
Input field to continue without having to return to the home page to be able to enter a sentence

Backend Error Page

Warning to inform the user that the analysis could not be performed because there is a malfunction in the backend
Invite the user to try to perform the analysis again later as he cannot do more at the moment

🧛🧛🧛‍♀️ Team work repartition

Aleksandar:

Application (flask, UI, etc)
Google App engine

Gauthier:

Literature review
Readme
Ponctuation analysis function

Maxime-Lucie:

Google AutoML: NLP and regression
Amazon Notebook: EDA, preprocessing, models (single and combined) and analysis of the models

🗄️ Sources

🧠 Cognates

English Word List
English/French Suffixes: Code de Traduction'
Deceptive Cognates
French Stop Words (with some modifications)

🗃️ Dataset

📗 Books

Barnes, Djurna. 1986. Le Bois de la nuit. Points roman.
Césaire, Aimé. 1939. Cahier d'un retour au pays natal. Paris: Pierre Bordas.
De Beauvoir, Simone. 1949. Le Deuxième Sexe. Paris: NRF.
De La Fontaine, Jean. 1778. Fables de La Fontaine. Fides.
De Maupassant, Guy. 1885. Bel-Ami. Paris: Victor Havard.
De Maupassant, Guy. 1887. Le Horla. Paris: Paul Ollendorff.
De Saint-Exupéry, Antoine. 1943. Le petit prince. Paris: Gallimard.
Diome, Fatou. 2003. Le ventre de l'Atlantique. Paris: Anne Carrière.
Flaubert, Gustave. 1857. Madame Bovary. Paris: Michel Lévy frères.
Echenoz, Jean. 2001. Jérôme Lindon. Paris: Éditions de Minuit.
Pennac, Daniel. 2007. Chagrin d'école. Paris: Éditions Gallimard.
Proust, Marcel. 1913. Du côté de chez Swann. Paris: Bernard Grasset.
Proust, Marcel. 1918. À l'ombre des jeunes filles en fleurs. Paris: Éditions Gallimard.
Queneau, Raymond. 1947. Exercices de style. Paris: Gallimard.
Rostand, Edmond. 1898. Cyrano de Bergerac. Paris: Charpentier et Fasquelle.
Schmitt, Éric-Emmanuel. 2001. Monsieur Ibrahim et les fleurs du Coran. Paris: Albin Michel.
Verne, Jules. 1896. Vingt mille lieues sous les mers. Paris: Hetzel.
Voltaire. 1759. Candide. Genève: Gabriel Cramer.
Zola, Émile. 1883. Au bonheur des dames. Paris: Georges Charpentier.
Zola, Émile. 1877. L’Assommoir. Paris: Georges Charpentier.

🔬 Studies

📰 Online Articles

🌐 Other Websites

Name		Name	Last commit message	Last commit date
Latest commit History 349 Commits
Datasets		Datasets
Features functions		Features functions
Resources		Resources
APPLICATION_code.zip		APPLICATION_code.zip
Final2_Notebook_Amazon.ipynb		Final2_Notebook_Amazon.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕵️ Project Description

🎯 AIcrowd Final result

🧠🧠 Application link

📚 Review of the Existing Literature

💭 How Do We Intend to Solve the Problem?

🤘 First iteration: Creating/Evaluating the Model

First Model Using Text Classification in Google Cloud Natural Language Automl

Custom Model: Features Engineering

🤘🤘 Second iteration: Iterate & Improve

✔️ Final Application

🧛🧛🧛‍♀️ Team work repartition

🗄️ Sources

🧠 Cognates

🗃️ Dataset

About

Releases

Packages

Contributors 3

Languages

epicalekspwner/BigScaleAnalytics2021

Folders and files

Latest commit

History

Repository files navigation

🕵️ Project Description

🎯 AIcrowd Final result

🧠🧠 Application link

📚 Review of the Existing Literature

💭 How Do We Intend to Solve the Problem?

🤘 First iteration: Creating/Evaluating the Model

First Model Using Text Classification in Google Cloud Natural Language Automl

Custom Model: Features Engineering

🤘🤘 Second iteration: Iterate & Improve

✔️ Final Application

🧛🧛🧛‍♀️ Team work repartition

🗄️ Sources

🧠 Cognates

🗃️ Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages