This repository illustrates the task of applying Machine Translation ( Seq2Seq Attention Network ) for Product Categorization of an E-Commerce Website data (Flipkart), classification of the description of products into the primary category of their category tree, and documenting the path to an optimal model pipeline

Table of Contents

  1. About
  2. Installation
  3. Data Aquisition
  4. Data Exploration & Cleaning
  5. Data Visualzation
  6. Approaches
  7. Build on Google Colab
  8. Models Summary
  9. Conclusions
  10. Future Scope
  11. Publication References


Clone the repo

git clone
cd Flipkart_Product_Categorization/

** Note: The Code is Implemented in Google Colaboratory that lets us build the project without installing it locally. Installation of some libraries may take some time depending on your internet connection and system properties. You can download the Colab Notebook as a Jupyter Notebook and Run it Locally or on the Google Colab Platform as well

Data Aquisition

You can download the E-Commerce Dataset sample from here

Data Exploration and Cleaning

  • The dataset has 2 NA values in the Description which were Dropped
  • The Product Category Tree has 337 Rows of Data that does not have a Primary Category Specified, thus they were categorised as "Others" to avoid loss of Data

The following steps are performed:

  1. Tokenization: Split the text into sentences and the sentences into words.
  2. Lowercase the words and remove punctuation.
  3. Words that have fewer than 3 characters are removed.
  4. All stopwords are removed.
  5. Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
  6. Words are stemmed — words are reduced to their root form.

**Note: Alternate approach to cleaning can be using BeautifulSoup and Selenium to scrape the product category from the website using the Product URL

Data Visualization

The Value Counts after cleaning the Primary Categories


Visualising the Length of Description in a Box Plot to get inference on Variation from the Mean


Features Used

Feature Name Type Description
Description STR The description of the Product (Primary Feature)
Product_Category_Tree STR Used to Extract the Primary Category

Techniques -

  1. Topic Modelling (Gensim's Latent Dirichlet Allocation Multicore algorithm )

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

  1. Seq2Seq + Attention + Teacher Forcing Neural Machine Transaltion ( 1 Attention Layer )

For product categorization task, the conventional methods are based on machine learning classification algorithms, but this paper : - ("Don’t Classify, Translate: Multi-Level E-Commerce Product Categorization Via Machine Translation". Maggie Yundi Li, Liling Tan, Stanley Kok. 2018. ) has proposed a new paradigm based on machine translation and it has shown that this approach achieves better predictiion accuracy than the classification system. Here we have implemented the proposed model.

Classification Report


Build on Google Colab

To get started, upload and open the notebooks in playground mode and run the cells(You must be logged in with your google account and provide additional authorization). If you want to run locally, a requirements.txt file is provided

git clone
cd Flipkart_Product_Categorization/
pip install -r requirements.txt

Models Summary

Model Name Accuracy
Seq2Seq + Attention + Teacher Forcing 81%


  • There are 337 rows in the Category tree that do not have a Primary Category in the dataset.
  • The dataset is not huge so BERT is not used for machine learning classification
  • RNNs are a good conceptual fit with NLP, but according to research, methods using attention have been achieving state of the art results on NLP.
  • Based on the previous research done on this task, Machine Translation has been proposed to improve the accuracy and perform better than machine learning classification algorithms. Here we have implemented the same and the Sequence2Sequence Attention model has shown to have 81% accuracy.
  • The accuracy can be icreased by further experimenting with more Attention Layers and changing the parameters.

Future Scope

  1. Running LDA using TFIDF
  2. 3-5 Attention Layers in Seq2Seq Translation
  3. 3-5 Attention Layers with Cross Entropy Validation in Seq2Seq Machine Transaltion
  4. Training the Transformer Model for Machine Translation
  5. Ensembling the best Seq2Seq Attention model explored with the Transformer model
  6. During translation, entities in a sentence play an important role, and their correct translation heavily affects the whole translation quality of the sentence. Therefore, targetting Knowledge Graphs in Encoders by using different combining techniques can be explored.
  7. More research is needed on the various strenghts of RNNs, CNNs, and transformers/attention and ensembling the approaches to combine the best of each.

