Additive/concat Attention
- Resources: [Paper], [Illustrative Intro], [TF2 Implementation]
Multiplicative Attention
- Resources: [Paper], [Illustrative Intro], [TF2 Implementation]
Multi-head Self Attention / Transformer
- Summary: HuggingFace Tokenizer Summary
- Implementation: HuggingFace Tokenizer, Google SentencePiece
- Unigram Language Model (ULM)
- assume all subword occurence are independent and subword sequence is produced by the product of subword occurrence probabilities
- optimize for whole sentence likelihood probability (Viterbi Algorithm)
- both WP and ULM leverages language model to build subword vocabulary
- Byte Pair Encoding (BPE)
- start from character level, form a new subword based on the next highest frequency pair until reaching desired vocabulary size or the next highest frequency is 1
- used in GPT-2, RoBERTa, see Git Issue for implementation
tokenizers.CharBPETokenizer
:OpenAIGPTTokenizerFast
,tokenizers.ByteLevelBPETokenizer
:GPT2TokenizerFast
,RobertaTokenizerFast
,LongformerTokenizerFast
- WordPiece (WP)
- similar to BPE but "choose the new word unit out of all possible ones that increase the likelihood on the training data the most when added to the model"
- define
log P(sentence) = Σ log P(token_i)
when merge adjacent tokens x and y into z
the change in likelihood islog P(token_z) - (log P(token_x) + log P(token_y))
tokenizers.BertWordPieceTokenizer
:BertTokenizerFast
,DistilBertTokenizerFast
,ElectraTokenizerFast
,RetriBertTokenizerFast
,MobileBertTokenizerFast
Google Neural Machine Translation System
Concept applied:
Additive/concat attention
,Residual connection
,Vanilla dropout
Resources: [Paper], [Illustrative Intro], [TF2 Implementation], [Torch Implementation]
BERT: Bidirectional Encoder Representations from Transformers
Resources: [Paper]
Conditional Random Field
Resources: [Introduction to CRF], [CRF vs MRF], [CRF for Multi-label Classification] [Tensorflow CRF];
Bi-LSTM CRF
Resources: [Paper], [TF1.0 Implementation by Scofield]
Label Attention Network
Resources: [Paper], [Torch Implementation by Author]
Transformer Training
Pre-Layer Normalization Transformer: [Paper]
Training Tips for Transformer: [Paper]
Recurrent Neural Network Normalization
Resources: [Methodology Overview], [Layer Normalization]
Experience: useBatchNormalization
orLayerNormalization
after each RNN layer
Recurrent Neural Network Dropout
Resources: [Methodology Overview], [Vanilla Dropout], [Variational Dropout], [Recurrent Dropout]
Experience: set dropout ratio between0.1
and0.3
, begin withvanilla dropout