To improve one’s skills in a new foreign language, it is important to read texts in this language. But in order to make learning really effective, these text have to be at the reader’s language level. However, it is difficult to find texts that are close to someone’s knowledge level (A1 to C2).
This project aims to build a model for English speakers that predicts the difficulty of a French written text. This can be then used, e.g., in a recommendation system, to recommend texts that are appropriate for someone’s language level. If someone is at A1 French level, it is inappropriate to present a text at B2 level, as she won’t be able to understand it. Ideally, a text should have many known words and may have a few words that are unknown so that the person can improve.
This project is iterative and will be conducted in multiple milestones that are due throughout the semester:
https://epicalekspwner-bsa2021.uc.r.appspot.com/
🥨 Learning German
🥖 Learning French
🐟 Learning Portuguese
After much thought, we plan to approach this project as a classification problem. After building our model and training it, the output of out model is to predict a discrete class label, i.e., in our case, predict the difficulty of a unlabelled sentence (from A1 to C2). Modeling this problem as a classification problem will also allow us to evaluate the model in terms of accuracy, whose interpretation is intuitive in our case.
The feature engineering we wish to explore can be reprensented by the following points:
📃 Words
- Categorize the different types of words (Part-Of-Speech Tagger)
- Count the word frequency for each categorie (i.e., the more frequent a word is, the easier it should be to be assimilated)
- Create a dictionary for each level that contains the most frequent words
- Analyze the grouping of letters
- Deal with deceptive cognates (i.e. words that resemble French ones, but do not have the same meaning) (list of 139 most common deceptive cognates)
- Deal with cognates (i.e. words that resemble French ones and from which one can deduce the meaning): 2 possible options
- 1st solution: list of 58,000 english words (need to be translated into French and then lemmetized both to obtain similar roots)
- 2nd solution: look at the suffixes (e.g. words ending by "tion" have a high probabilty to have a straightforward translation in English)
📃 Sentences
- Measure the length of sentences (i.e., the shorter the sentence, the easier it is to understand and vice versa)
- Count punctuation (i.e., a more complex sentence will tend to contain more punctuation to handle grammatical difficulty)
- Count the different types of words (i.e., a complex sentence will tend to be composed of a combination of several different types of words (noun, verb, adverb, pronoun, preposition, conjunction, etc.)
🤖 Potential Algorithms
- Logistic Regression
- Naive Bayes Classifier
- K-nearest Neighbors
- Decision Tree
📦 Libraries We Intend to Use
- NLTK Snowball (Stemmer)
- Spacy French LEFFF
- French specific libraries
- And more 😉
---------------------------------------------------------------------------------------------------------------------------------
Dataset used: Team Amazon dataset
- Model type: Single-label classification
- Test items: 102
- Precision: 80%
- Recall: 35.29%
🗜️ Confusion Matrix
Note 1: We can see that our model has difficulties at predicting B2 and C1 levels. One of the reasons might be the assessment of these levels by our team. One solution could be reevaluating these two levels more carefully.
Note 2: Our model is super sensible to the length of the sentence. If we put mutiple times a pretty simple sentence, it could end up with up to a C2 level. We will need to reinvestigate this concern further!
Done | Feature Name | Method |
---|---|---|
✔️ | Sentences lengths | Return the length of the sentence |
✔️ | Type of words | Return a dict {"Word": "Type of word"} |
✔️ | Number of punctuation | Return the number of punctuation there is in the sentence |
✔️ | Deceptive cognates | Return the number of deceptive cognomes there is in the sentence (see Graph (a)) |
✔️ | Cognates | Return list of 14,000 possible cognates and the similarity between the two roots (French and English) |
✔️ | Common words for each category | Creation of list with the most common words for each category |
Graph (a): Deceptive Cognomes
We can see on the above graph that the more complex a sentence is, the more deceptive cognates (aka false friends) there are.
---------------------------------------------------------------------------------------------------------------------------------
Dataset used: TAs' dataset
Screenshots of the Prototype
The interface was fully coded in HTML and consisted of a simple input box which allowed to analysis the sentence provided by the user.
The application was coded considering only the case where the user would actually enter a sentence and the case where the user would click on "predict" without any input would systematically return an error. If there was actually an input, the application would return a extremely simple response on a second HTML page.
Some Changes From Milestone 2
For this version, we used the datatset provided by the TAs to do the models, whereas in Milestone 2 we used our dataset which was biased.
We used several models from basic one (linear regression) to more complex ones (GC). From the various models, we used the model from Google Cloud Natural Language Classification which is the best model in terms of accuracy on Aicrowd with an accuracy of 53.3%. We tried a few combinations but unfortunatly it did not improve our ranking on Aicrowd.
What Did We Use as Librairies or Services?
- Cloud services: Google AutoML (Regression and NLP), Google Colab, Google App Engine
- NLP Librairies: Spacy (Multi-langual package), NLTK (Multi-langual package), Camembert (French package)
- Machine Learning librairies: Scikit-Learn
- App: Flask + Python
📐 General Architecture
🧠 Cognates Problem
The idea: when we were young our teachers gave us some "tips" on how to detect cognates from french to english. We had to look for the suffixes ...
🧠 Deceptive Cognates
We also took into account false friends and created a function that count the number of deceptive cognates in order to integrate it into our models.
🌡️ Models Evaluation
---------------------------------------------------------------------------------------------------------------------------------
The final model we used in our application is the GC (1)
Home Page
- Eye-catcher
- Input field for the sentence to be analysed (processed when "Predict" is clicked or deleted when "Reset" is clicked)
- Brief presentation of the project
- Direct link to our team's Github repo
Result Page
- Input sentence and its translation
- Predicted level of difficulty with its corresponding probability
- Prediction matrix containing the remaining levels and their respective probabilities
- Explaination of the different levels
- Dependency parse visualizer
Inexistant Input Error Page
- Warning to inform the user that the analysis could not be performed because no sentence was entered in the field provided
- Input field to continue without having to return to the home page to be able to enter a sentence
Backend Error Page
- Warning to inform the user that the analysis could not be performed because there is a malfunction in the backend
- Invite the user to try to perform the analysis again later as he cannot do more at the moment
Aleksandar:
- Application (flask, UI, etc)
- Google App engine
Gauthier:
- Literature review
- Readme
- Ponctuation analysis function
Maxime-Lucie:
- Google AutoML: NLP and regression
- Amazon Notebook: EDA, preprocessing, models (single and combined) and analysis of the models
- English Word List
- English/French Suffixes: Code de Traduction'
- Deceptive Cognates
- French Stop Words (with some modifications)
📗 Books
- Barnes, Djurna. 1986. Le Bois de la nuit. Points roman.
- Césaire, Aimé. 1939. Cahier d'un retour au pays natal. Paris: Pierre Bordas.
- De Beauvoir, Simone. 1949. Le Deuxième Sexe. Paris: NRF.
- De La Fontaine, Jean. 1778. Fables de La Fontaine. Fides.
- De Maupassant, Guy. 1885. Bel-Ami. Paris: Victor Havard.
- De Maupassant, Guy. 1887. Le Horla. Paris: Paul Ollendorff.
- De Saint-Exupéry, Antoine. 1943. Le petit prince. Paris: Gallimard.
- Diome, Fatou. 2003. Le ventre de l'Atlantique. Paris: Anne Carrière.
- Flaubert, Gustave. 1857. Madame Bovary. Paris: Michel Lévy frères.
- Echenoz, Jean. 2001. Jérôme Lindon. Paris: Éditions de Minuit.
- Pennac, Daniel. 2007. Chagrin d'école. Paris: Éditions Gallimard.
- Proust, Marcel. 1913. Du côté de chez Swann. Paris: Bernard Grasset.
- Proust, Marcel. 1918. À l'ombre des jeunes filles en fleurs. Paris: Éditions Gallimard.
- Queneau, Raymond. 1947. Exercices de style. Paris: Gallimard.
- Rostand, Edmond. 1898. Cyrano de Bergerac. Paris: Charpentier et Fasquelle.
- Schmitt, Éric-Emmanuel. 2001. Monsieur Ibrahim et les fleurs du Coran. Paris: Albin Michel.
- Verne, Jules. 1896. Vingt mille lieues sous les mers. Paris: Hetzel.
- Voltaire. 1759. Candide. Genève: Gabriel Cramer.
- Zola, Émile. 1883. Au bonheur des dames. Paris: Georges Charpentier.
- Zola, Émile. 1877. L’Assommoir. Paris: Georges Charpentier.
🔬 Studies
- Klaus, Jacopo. 2019. "Les défis de l'aménagement du territoire dans un système fédéral. L'évolution du rôle des cantons et des communes suisses entre limitations quantitatives et enjeux qualitatifs de l'urbanisation" Thesis, University of Lausanne.
- OECD. 2019. Études économiques de l’OCDE : Sythèse sur la Suisse.
- Office fédéral de la statistique. 2020. Endettement : Arriérés de paiement en 2019.
- Office fédéral de la statistique. 2019. Énergie : Aspects économiques.
- Office fédéral de la statistique. 2020. Enquête suisse sur la santé (ESS) 2017 : Santé et genre.
- Office fédéral de la statistique. 2020. Le système d'indicateurs «Mesure du bien-être».
- Office fédéral de la statistique. 2020. Panorama de la société suisse 2020.
- Office fédéral de la statistique. 2020. Transport routier, ferroviaire et aérien : Coûts et financement des transports 2017.
- Pauchard, Nicolas. 2019. "Gouverner les ressources génétiques. Les stratégies des acteurs face aux droits de propriété et aux règles sur l'accès et le partage des avantages" Thesis, University of Lausanne.
📰 Online Articles
- AFP. 2021. "Spotify se lance dans plus de 80 nouveaux pays." Le Temps, February 23, 2021.
- ATS. 2021. "Plus d'un tiers des appartements suisses sont occupés par des personnes seules." Le Temps February 25, 2021.
- Bazin, Xavier. 2021. "Vaccin en Israël : des chiffres troublants." FranceSoir, February 25, 2021.
- Colleu, Yannick. 2021. "COVID 19 : les données de Santé Publique France sont-elles fiables ?" FranceSoir, February 24, 2021.
- Etienne, Richard. 2021. "Genève accorde un prêt historique de 200 millions à l’aéroport de Cointrin." Le Temps, February 25, 2021.
- Favre, Laurent. 2021. "Novak Djokovic est imbattable, la preuve par neuf." Le Temps, February 21, 2021.
- FranceSoir, AFP. 2021. "Pays-Bas : le couvre-feu, objet d'un bras de fer judiciaire." FranceSoir, February 17, 2021.
- FranceSoir. 2021. "Les néonicotinoïdes : une menace pour les mammifères." FranceSoir, February 21, 2021.
- Genecand, Marie-Pierre. 2021. "La médiation de voisinage, comme si vous y étiez." Le Temps, February 8, 2021.
🌐 Other Websites
- Bernard, Emeline. 2021. "10 erreurs à ne pas faire avec son chien selon les vétérinaires." Ohmymag, March 2, 2021.
- Bourrelly, Pierre & Lefeuvre, Jean Claude. "CYANOBACTÉRIES ou CYANOPHYCÉES, anc. ALGUES BLEUES" Encyclopædia Universalis.
- "French Texts for Beginners". Lingua.
- La Dépêche. 2021. "Tout savoir avant d’adopter un chat." La Dépêche, February 24, 2021.
- Lavorel, Jean & Mazliak, Paul & Moyse, Alexis. "PHOTOSYNTHÈSE" Encyclopædia Universalis.