- Replicate the SOTA paper for Chinese segmentation (Stacked BiLSTM or Unstacked BiLSTM model)
- get requirements
pip install requiremenets.txt
- download dataset
wget http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip
- download pretrained embeddings
wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.zh.vec
- Read configuration files
- Read dataset and preprocess it
preprocessing.py
- Create input file, which is the same dataset but no spaces between words
- Create labels file, which is the BIES format as per line
- BIES format >> B: beginning, I: Inside, E: End, S: Single
- "The answer is 42!" -> "BIE BIIIIE BE BE S"
- Process the dataset
dataset_handling.py
- Get unique unigrams & Bigrams to create our vocabulary
- Save the vocabulary in json format for later use
- Get the labels of our classes (B, I, E, S)
- We need to pad data before passing it to the model
- 1st Option: Pad as per maximum length (user defined, improved by experimenting), as I did in
main.py
- 2nd Option: Pad as per maximum length as per batch, as I did in
main_batch.py
- 1st Option: Pad as per maximum length (user defined, improved by experimenting), as I did in
- After padding we need to change labels to one hot encoded matrix
- Get unique unigrams & Bigrams to create our vocabulary
- Create our Bi-LSTM model with 256 cells, Nesterov Momentum SGD, with masking to mask our padded sequences
- Train the model
- If we took padding option #01: we train on the whole data
model.fit()
- If we took padding option #02: we train as per batch
train_on_batch()
- model was trained used pretrained embedding
- either or, I set
EarlyStopping
to avoid overfitting,ModelCheckpoint
to save best model every 5 epochs, as well as,ReduceLROnPlateau
so model don't get stuck in saddle point/shoulder
- If we took padding option #01: we train on the whole data
- Evaluate the model
- Download Chinese Datasets
- Simplify Chinese Datasets, using
Hanzi Conv
- Install hanziconv >>
pip install hanziconv
- execute
hanzi-convert -o 'OUTFILE_NAME' -s 'INFILE_NAME'
- automate the process of it
- Install hanziconv >>
- BIES format implementation on an English dataset
- Produce input file (same as original file but with no whitespace)
- Produce labels file, only labels
- Replicate BIES for Chinese dataset
- Implement Read dataset method
- Implement Generate Labels method
- Implement Generate Batches method
- Reformulate the code, to keep a nice tidy structure
- Implement Char embeddings
- Implement
keras
unigrams and bigrams embeddings - Use pretrained embeddings ['GloVe']
- Implement
- Implement paper 1st model (Bi-LSTM: unstacked) using
keras
- Implement paper 2nd model (Bi-LSTM: stacked) using
keras
- Split data, train-dev: 80-20
-
Tensorboard
variables - Save & load model weights functionality
- Train the model(s)
- Loss & acc function, to mask zeros done by
pad_sequence
- Loss & acc function, to mask zeros done by
- plot the models
- break sentences of length more than max_length into list of sentences
- Use dev set to evaluate the model
- Read the dataset
- Test the implementation
- Concatenate all the datasets
- for training
- for dev
- Construct a pipeline for the whole code
- Save vocab dict to json file
- Implement predict function
- Test implementation of predict function
- Try using
score.py
- Tune the hyperparameters
- Apply manual
GridSearchCV
to find best params- Learning rate [0.003, 0.004]
- L2 Regularizer
- Apply manual
- preprocessing.bies_format is over complicated
- preprocessing.text_bies_format is over complicated