NLP Homework 1

Brief Introduction

Replicate the SOTA paper for Chinese segmentation (Stacked BiLSTM or Unstacked BiLSTM model)

Getting started

get requirements pip install requiremenets.txt
download dataset wget http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip
download pretrained embeddings wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.zh.vec

How it works

Read configuration files
Read dataset and preprocess it preprocessing.py
1. Create input file, which is the same dataset but no spaces between words
2. Create labels file, which is the BIES format as per line
  - BIES format >> B: beginning, I: Inside, E: End, S: Single
  - "The answer is 42!" -> "BIE BIIIIE BE BE S"
Process the dataset dataset_handling.py
1. Get unique unigrams & Bigrams to create our vocabulary
  1. Save the vocabulary in json format for later use
2. Get the labels of our classes (B, I, E, S)
3. We need to pad data before passing it to the model
  1. 1st Option: Pad as per maximum length (user defined, improved by experimenting), as I did in main.py
  2. 2nd Option: Pad as per maximum length as per batch, as I did in main_batch.py
4. After padding we need to change labels to one hot encoded matrix
Create our Bi-LSTM model with 256 cells, Nesterov Momentum SGD, with masking to mask our padded sequences
- Since we pass data using two language models 1-grams and 2-grams, we need 2 input layers, 2 embeddings, and concatenate these embeddings, as shown in the below figure, model implementation model.py
Train the model
1. If we took padding option #01: we train on the whole data model.fit()
2. If we took padding option #02: we train as per batch train_on_batch()
- model was trained used pretrained embedding
- either or, I set EarlyStopping to avoid overfitting, ModelCheckpoint to save best model every 5 epochs, as well as, ReduceLROnPlateau so model don't get stuck in saddle point/shoulder
Evaluate the model

TODOs [the plan I followed]

Phase I [Preprocessing]

Download Chinese Datasets
Simplify Chinese Datasets, using Hanzi Conv
- Install hanziconv >> pip install hanziconv
- execute hanzi-convert -o 'OUTFILE_NAME' -s 'INFILE_NAME'
- automate the process of it
BIES format implementation on an English dataset
- Produce input file (same as original file but with no whitespace)
- Produce labels file, only labels
Replicate BIES for Chinese dataset

Phase II [Dataset handling]

Implement Read dataset method
Implement Generate Labels method
Implement Generate Batches method
Reformulate the code, to keep a nice tidy structure

Phase III [Models]

Phase IV [Predict]

Construct a pipeline for the whole code
Save vocab dict to json file
Implement predict function
- Test implementation of predict function
Try using score.py

Phase V [Hyperparameters tuning]

Tune the hyperparameters
- Apply manual GridSearchCV to find best params
  - Learning rate [0.003, 0.004]
  - L2 Regularizer

Enhancements

preprocessing.bies_format is over complicated
preprocessing.text_bies_format is over complicated

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
resources		resources
Bi-LSTM_model.png		Bi-LSTM_model.png
README.md		README.md
SOTA-chinese-word-segmentation-BiLSTM.pdf		SOTA-chinese-word-segmentation-BiLSTM.pdf
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Homework 1

Brief Introduction

Getting started

How it works

TODOs [the plan I followed]

Phase I [Preprocessing]

Phase II [Dataset handling]

Phase III [Models]

Phase IV [Predict]

Phase V [Hyperparameters tuning]

Enhancements

About

Releases

Packages

Languages

elsheikh21/chinese-word-segmentation

Folders and files

Latest commit

History

Repository files navigation

NLP Homework 1

Brief Introduction

Getting started

How it works

TODOs [the plan I followed]

Phase I [Preprocessing]

Phase II [Dataset handling]

Phase III [Models]

Phase IV [Predict]

Phase V [Hyperparameters tuning]

Enhancements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages