Skip to content

NLP learning notes, including classic papers, algorithm implementations, modeling tricks, and notes.

License

Notifications You must be signed in to change notification settings

ywu94/NLP-Notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Notes

Attention

Additive/concat Attention

Multiplicative Attention

Multi-head Self Attention / Transformer

Subword Tokenization

  • Summary: HuggingFace Tokenizer Summary
  • Implementation: HuggingFace Tokenizer, Google SentencePiece
  • Unigram Language Model (ULM)
    • assume all subword occurence are independent and subword sequence is produced by the product of subword occurrence probabilities
    • optimize for whole sentence likelihood probability (Viterbi Algorithm)
    • both WP and ULM leverages language model to build subword vocabulary
  • Byte Pair Encoding (BPE)
    • start from character level, form a new subword based on the next highest frequency pair until reaching desired vocabulary size or the next highest frequency is 1
    • used in GPT-2, RoBERTa, see Git Issue for implementation
    • tokenizers.CharBPETokenizer: OpenAIGPTTokenizerFast,
    • tokenizers.ByteLevelBPETokenizer: GPT2TokenizerFast, RobertaTokenizerFast, LongformerTokenizerFast
  • WordPiece (WP)
    • similar to BPE but "choose the new word unit out of all possible ones that increase the likelihood on the training data the most when added to the model"
      • define log P(sentence) = Σ log P(token_i)
        when merge adjacent tokens x and y into z
        the change in likelihood is log P(token_z) - (log P(token_x) + log P(token_y))
    • tokenizers.BertWordPieceTokenizer: BertTokenizerFast, DistilBertTokenizerFast, ElectraTokenizerFast, RetriBertTokenizerFast, MobileBertTokenizerFast

Industrial Application

Google Neural Machine Translation System

Concept applied: Additive/concat attention, Residual connection, Vanilla dropout
Resources: [Paper][Illustrative Intro][TF2 Implementation][Torch Implementation]

BERT: Bidirectional Encoder Representations from Transformers

Resources: [Paper]

Probabilistic Graph

Conditional Random Field

Resources: [Introduction to CRF][CRF vs MRF][CRF for Multi-label Classification]  [Tensorflow CRF];

Bi-LSTM CRF

Resources: [Paper][TF1.0 Implementation by Scofield]

Label Attention Network

Resources: [Paper][Torch Implementation by Author]

Modeling Tricks

Transformer Training

Pre-Layer Normalization Transformer: [Paper]
Training Tips for Transformer: [Paper]

Recurrent Neural Network Normalization

Resources: [Methodology Overview][Layer Normalization]
Experience: use BatchNormalization or LayerNormalization after each RNN layer

Recurrent Neural Network Dropout

Resources: [Methodology Overview][Vanilla Dropout][Variational Dropout][Recurrent Dropout]
Experience: set dropout ratio between 0.1 and 0.3, begin with vanilla dropout

About

NLP learning notes, including classic papers, algorithm implementations, modeling tricks, and notes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages