Skip to content

allenwind/tf2bert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tf2bert

本项目旨在拥抱Python3和Tensorflow2.x以及Attention、Transformer、PTMs。

介绍

本框架的目录结构类似于Tensorflow2.x,包括子packages:

  • layers
  • math
  • models
  • text
  • activations
  • callbacks
  • initializers
  • losses
  • metrics
  • optimizers
  • preprocessing
  • utils

本项目只依赖tensorflow2.xtensorflow-addons。目前支持BERT、RoBERTa、ALBERT、NEZHA、GPT等模型,包括CRF层(依赖tensorflow-addons)、Normalization、多种支持Mask的Pooling、Embedding(包括多种位置Embedding、HybridEmbedding)、常用激活函数、Metrics等等。

模型简述

简单梳理一下常见PTMs的特点:

模型 特点
BERT 多层的Transformer Encoder堆叠而成、经典的可训练PositionEmbedding、MLM + NSP、Tokenizer采用Byte Pair Encoding、中文版引入WWM(Whole Word Masking)
ALBERT Factorized Embedding Parameterization、跨层共享参数(可以理解成一种正则化手段)、引入句子顺序预测(SOP)
RoBERTa 中文WWM(Whole Word Masking)策略、动态mask、Tokenizer采用Byte Pair Encoding、去掉NSP、MLM、更大的数据集、更长的文本序列
ERNIE mask策略引入短语级别(phrase-level mask)与实体级别(entity-level mask)进而在模型中引入实体方面的先验知识
NEZHA 改用经典的相对位置PositionEmbedding、优化算法LAMB加速训练
GPT Transformer Decoder堆叠而成、语言模型、Embedding层叠加后不加LN
GPT2 更多参数更大的网络容量、LN移动到每个子模块输入之后、Attention后添加LN、输入去掉segment
GPT2ML 多语言支持、简化整理GPT2训练
+LM 计算下三角Mask,用于语言模型
+UniLM 通过Segment的下三角Mask,使得BERT支持Seq2Seq任务。Mask原理是,对于输入部分,做双向Attention,而对于输出,做单向Attention

预训练模型为什么能成?

  • 大规模高质量的无监督数据
  • 设计恰当的自(无)监督学习策略
  • 基于SelfAttention的Transformer是无偏置的特征提取器

使用

本项目依赖:tensorflow2.xtensorflow-addons。由于更新较快,不使用pip,推荐使用PYTHONPATH

克隆项目到{your_path}

git clone https://github.com/allenwind/tf2bert.git

当需要更新时,直接git pull获取源码更新或删除原项目后重新git clone

打开.bashrc添加项目路径到PYTHONPATH环境变量中,

export PYTHONPATH={your_path}/tf2bert:$PYTHONPATH

然后,

source ~/.bashrc

简单例子,

import numpy as np
from tf2bert.text.tokenizers import Tokenizer
from tf2bert.text.utils import load_sentences
from tf2bert.models import build_transformer

config_path = "bert/chinese_L-12_H-768_A-12/bert_config.json"
checkpoint_path = "bert/chinese_L-12_H-768_A-12/bert_model.ckpt"
token_dict_path = "bert/chinese_L-12_H-768_A-12/vocab.txt"

tokenizer = Tokenizer(token_dict_path)
model = build_transformer(
    model="bert+encoder", 
    config_path=config_path, 
    checkpoint_path=checkpoint_path,
    verbose=True
)

for sentence in load_sentences():
    token_ids, segment_ids = tokenizer.encode(sentence)
    token_ids = np.array([token_ids])
    segment_ids = np.array([segment_ids])
    features = model.predict([token_ids, segment_ids])
    print(sentence)
    print(features.shape)
    print(features)

更多的例子可参看nlptasks目录、tests目录代码。

权重下载

BERT/RoBERTa:

ALBERT:

NEZHA:

GPT/GPT2/GPT2ML:

XLNet:

相关链接

Attention Is All You Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Pre-Training with Whole Word Masking for Chinese BERT

RoBERTa: A Robustly Optimized BERT Pretraining Approach

NEZHA: Neural Contextualized Representation for Chinese Language Understanding

Unified Language Model Pre-training for Natural Language Understanding and Generation

Are Pre-trained Convolutions Better than Pre-trained Transformers?

Synthesizer: Rethinking Self-Attention in Transformer Models

MLP-Mixer: An all-MLP Architecture for Vision

Releases

No releases published

Packages

No packages published

Languages