本项目旨在拥抱Python3和Tensorflow2.x以及Attention、Transformer、PTMs。
本框架的目录结构类似于Tensorflow2.x,包括子packages:
- layers
- math
- models
- text
- activations
- callbacks
- initializers
- losses
- metrics
- optimizers
- preprocessing
- utils
本项目只依赖tensorflow2.x
、tensorflow-addons
。目前支持BERT、RoBERTa、ALBERT、NEZHA、GPT等模型,包括CRF层(依赖tensorflow-addons
)、Normalization、多种支持Mask的Pooling、Embedding(包括多种位置Embedding、HybridEmbedding)、常用激活函数、Metrics等等。
简单梳理一下常见PTMs的特点:
模型 | 特点 |
---|---|
BERT | 多层的Transformer Encoder堆叠而成、经典的可训练PositionEmbedding、MLM + NSP、Tokenizer采用Byte Pair Encoding、中文版引入WWM(Whole Word Masking) |
ALBERT | Factorized Embedding Parameterization、跨层共享参数(可以理解成一种正则化手段)、引入句子顺序预测(SOP) |
RoBERTa | 中文WWM(Whole Word Masking)策略、动态mask、Tokenizer采用Byte Pair Encoding、去掉NSP、MLM、更大的数据集、更长的文本序列 |
ERNIE | mask策略引入短语级别(phrase-level mask)与实体级别(entity-level mask)进而在模型中引入实体方面的先验知识 |
NEZHA | 改用经典的相对位置PositionEmbedding、优化算法LAMB加速训练 |
GPT | Transformer Decoder堆叠而成、语言模型、Embedding层叠加后不加LN |
GPT2 | 更多参数更大的网络容量、LN移动到每个子模块输入之后、Attention后添加LN、输入去掉segment |
GPT2ML | 多语言支持、简化整理GPT2训练 |
+LM | 计算下三角Mask,用于语言模型 |
+UniLM | 通过Segment的下三角Mask,使得BERT支持Seq2Seq任务。Mask原理是,对于输入部分,做双向Attention,而对于输出,做单向Attention |
预训练模型为什么能成?
- 大规模高质量的无监督数据
- 设计恰当的自(无)监督学习策略
- 基于SelfAttention的Transformer是无偏置的特征提取器
本项目依赖:tensorflow2.x
、tensorflow-addons
。由于更新较快,不使用pip,推荐使用PYTHONPATH
。
克隆项目到{your_path}
,
git clone https://github.com/allenwind/tf2bert.git
当需要更新时,直接git pull
获取源码更新或删除原项目后重新git clone
。
打开.bashrc
添加项目路径到PYTHONPATH
环境变量中,
export PYTHONPATH={your_path}/tf2bert:$PYTHONPATH
然后,
source ~/.bashrc
简单例子,
import numpy as np
from tf2bert.text.tokenizers import Tokenizer
from tf2bert.text.utils import load_sentences
from tf2bert.models import build_transformer
config_path = "bert/chinese_L-12_H-768_A-12/bert_config.json"
checkpoint_path = "bert/chinese_L-12_H-768_A-12/bert_model.ckpt"
token_dict_path = "bert/chinese_L-12_H-768_A-12/vocab.txt"
tokenizer = Tokenizer(token_dict_path)
model = build_transformer(
model="bert+encoder",
config_path=config_path,
checkpoint_path=checkpoint_path,
verbose=True
)
for sentence in load_sentences():
token_ids, segment_ids = tokenizer.encode(sentence)
token_ids = np.array([token_ids])
segment_ids = np.array([segment_ids])
features = model.predict([token_ids, segment_ids])
print(sentence)
print(features.shape)
print(features)
更多的例子可参看nlptasks
目录、tests
目录代码。
BERT/RoBERTa:
- brightmart版roberta: https://github.com/brightmart/roberta_zh
- ymcui版roberta: https://github.com/ymcui/Chinese-BERT-wwm
- Google版bert: https://github.com/google-research/bert
ALBERT:
- brightmart版albert: https://github.com/brightmart/albert_zh
- Google原版albert: https://github.com/google-research/ALBERT
NEZHA:
GPT/GPT2/GPT2ML:
XLNet:
- ymcui版的XLNet: https://github.com/ymcui/Chinese-XLNet
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Pre-Training with Whole Word Masking for Chinese BERT
RoBERTa: A Robustly Optimized BERT Pretraining Approach
NEZHA: Neural Contextualized Representation for Chinese Language Understanding
Unified Language Model Pre-training for Natural Language Understanding and Generation
Are Pre-trained Convolutions Better than Pre-trained Transformers?
Synthesizer: Rethinking Self-Attention in Transformer Models