Add LayoutLM Model #7064

liminghao1630 · 2020-09-11T03:34:14Z

Introduction

This pull request implements the LayoutLM model, as defined in the paper:

LayoutLM: Pre-training of Text and Layout for Document Image Understanding 
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, KDD 2020

LayoutLM is a simple but effective pre-training method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding. LayoutLM archives the SOTA results on multiple datasets.

Typical workflow for including a model

Here an overview of the general workflow:

Let's detail what should be done at each step.

Adding model/configuration/tokenization classes

Here is the workflow for adding model/configuration/tokenization classes:

Copy the python files from the present folder to the main folder and rename them, replacing xxx with your model
name.
Edit the files to replace XXX (with various casing) with your model name.
Copy-paste or create a simple configuration class for your model in the configuration_... file.
Copy-paste or create the code for your model in the modeling_... files (PyTorch and TF 2.0).
Copy-paste or create a tokenizer class for your model in the tokenization_... file.

Adding conversion scripts

Here is the workflow for the conversion scripts:

Copy the conversion script (convert_...) from the present folder to the main folder.
Edit this script to convert your original checkpoint weights to the current pytorch ones.

Adding tests:

Here is the workflow for the adding tests:

Copy the python files from the tests sub-folder of the present folder to the tests subfolder of the main
folder and rename them, replacing xxx with your model name.
Edit the tests files to replace XXX (with various casing) with your model name.
Edit the tests code as needed.

Documenting your model:

Here is the workflow for documentation:

Make sure all your arguments are properly documented in your configuration and tokenizer.
Most of the documentation of the models is automatically generated, you just have to make sure that
XXX_START_DOCSTRING contains an introduction to the model you're adding and a link to the original
article and that XXX_INPUTS_DOCSTRING contains all the inputs of your model.
Create a new page xxx.rst in the folder docs/source/model_doc and add this file in docs/source/index.rst.

Make sure to check you have no sphinx warnings when building the documentation locally and follow our
documentaiton guide.

Final steps

You can then finish the addition step by adding imports for your classes in the common files:

Add import for all the relevant classes in __init__.py.
Add your configuration in configuration_auto.py.
Add your PyTorch and TF 2.0 model respectively in modeling_auto.py and modeling_tf_auto.py.
Add your tokenizer in tokenization_auto.py.
Add a link to your conversion script in the main conversion utility (in commands/convert.py)
Edit the PyTorch to TF 2.0 conversion script to add your model in the convert_pytorch_checkpoint_to_tf2.py
file.
Add a mention of your model in the doc: README.md and the documentation itself
in docs/source/index.rst and docs/source/pretrained_models.rst.
Upload the pretrained weights, configurations and vocabulary files.
Create model card(s) for your models on huggingface.co. For those last two steps, check the
model sharing documentation.

codecov · 2020-09-11T05:42:30Z

Codecov Report

Merging #7064 into master will decrease coverage by 0.09%.
The diff coverage is 31.25%.

@@            Coverage Diff             @@
##           master    #7064      +/-   ##
==========================================
- Coverage   80.32%   80.23%   -0.10%     
==========================================
  Files         174      177       +3     
  Lines       33446    33878     +432     
==========================================
+ Hits        26867    27181     +314     
- Misses       6579     6697     +118

Impacted Files	Coverage Δ
src/transformers/modeling_layoutlm.py	`25.56% <25.56%> (ø)`
src/transformers/activations.py	`86.36% <50.00%> (-3.64%)`	⬇️
src/transformers/configuration_layoutlm.py	`80.00% <80.00%> (ø)`
src/transformers/__init__.py	`99.37% <100.00%> (+0.01%)`	⬆️
src/transformers/configuration_auto.py	`96.20% <100.00%> (+0.04%)`	⬆️
src/transformers/modeling_auto.py	`82.46% <100.00%> (+0.08%)`	⬆️
src/transformers/tokenization_auto.py	`92.30% <100.00%> (+0.12%)`	⬆️
src/transformers/tokenization_layoutlm.py	`100.00% <100.00%> (ø)`
src/transformers/modeling_tf_funnel.py	`18.53% <0.00%> (-75.51%)`	⬇️
src/transformers/modeling_tf_flaubert.py	`24.53% <0.00%> (-63.81%)`	⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cbf0f7...4e95220. Read the comment docs.

JetRunner

The PR is overall in a good shape! Thanks for your work!

@sgugger @patrickvonplaten Interested in this?

src/transformers/modeling_layoutlm.py

JetRunner · 2020-09-15T14:58:36Z

src/transformers/modeling_layoutlm.py

+        if bbox is None:
+            bbox = torch.zeros(tuple(list(input_shape) + [4]), dtype=torch.long, device=device)
+
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)


I think you may find torch.view() better here.

JetRunner · 2020-09-15T14:58:47Z

src/transformers/modeling_layoutlm.py

+            if head_mask.dim() == 1:
+                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
+            elif head_mask.dim() == 2:
+                head_mask = head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)


src/transformers/modeling_layoutlm.py

sgugger

Thanks a lot for adding this model! Before we can merge, there is a tiny bit of work to do to make the modeling file independent from bert_modeling.py. Could you copy the classes you use from that file to the layeroutlm_modeling file (and rename them from BertXxx to LayeroutLMXxx), you can look at the files for GPT and GPT-2 as example. The idea is that this way, any researcher can directly tweak the model by copying its file without having to worry about extra stuff.

Otherwise, all look good to me apart form a few nits.

docs/source/model_doc/layoutlm.rst

src/transformers/modeling_layoutlm.py

src/transformers/tokenization_auto.py

src/transformers/tokenization_layoutlm.py

tests/test_modeling_layoutlm.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

liminghao1630 · 2020-09-21T02:20:22Z

Hi, I have moved and renamed the classes from modeling_bert to modeling_layoutlm and the code is ready for review and merge. @JetRunner @sgugger

sgugger

Thanks for the copied code. This all looks good to me and ready to merge, with just one nit.

src/transformers/modeling_layoutlm.py

LysandreJik

This is exceptionally clean! Great job!

src/transformers/configuration_layoutlm.py

src/transformers/modeling_layoutlm.py

src/transformers/tokenization_layoutlm.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

liminghao1630 · 2020-09-22T02:28:14Z

@JetRunner @sgugger @LysandreJik The new suggestions have been adopted.

LysandreJik

Great, thanks a lot @liminghao1630

* first version * finish test docs readme model/config/tokenization class * apply make style and make quality * fix layoutlm GitHub link * fix conflict in index.rst and add layoutlm to pretrained_models.rst * fix bug in test_parents_and_children_in_mappings * reformat modeling_auto.py and tokenization_auto.py * fix bug in test_modeling_layoutlm.py * Update docs/source/model_doc/layoutlm.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/model_doc/layoutlm.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove inh, add tokenizer fast, and update some doc * copy and rename necessary class from modeling_bert to modeling_layoutlm * Update src/transformers/configuration_layoutlm.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/configuration_layoutlm.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/configuration_layoutlm.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/configuration_layoutlm.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_layoutlm.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_layoutlm.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_layoutlm.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * add mish to activations.py, import ACT2FN and import logging from utils Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

This reverts commit 9beb08c.

julien-c added the model card Related to pretrained model cards label Sep 15, 2020

liminghao1630 changed the title ~~[WIP]Add LayoutLM Model~~ Add LayoutLM Model Sep 15, 2020

JetRunner reviewed Sep 15, 2020

View reviewed changes

JetRunner requested review from sgugger and patrickvonplaten September 15, 2020 15:02

sgugger requested changes Sep 16, 2020

View reviewed changes

liminghao1630 and others added 12 commits September 21, 2020 01:25

first version

5d69c11

finish test docs readme model/config/tokenization class

0c85983

apply make style and make quality

d627691

fix layoutlm GitHub link

c2f8458

fix conflict in index.rst and add layoutlm to pretrained_models.rst

fc76d58

fix bug in test_parents_and_children_in_mappings

da50956

reformat modeling_auto.py and tokenization_auto.py

7b8f83f

fix bug in test_modeling_layoutlm.py

ef3be96

Update docs/source/model_doc/layoutlm.rst

988930c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update docs/source/model_doc/layoutlm.rst

3d00991

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

remove inh, add tokenizer fast, and update some doc

368bb84

copy and rename necessary class from modeling_bert to modeling_layoutlm

ad63aaa

sgugger approved these changes Sep 21, 2020

View reviewed changes

src/transformers/modeling_layoutlm.py Outdated Show resolved Hide resolved

LysandreJik approved these changes Sep 21, 2020

View reviewed changes

Minghao Li and others added 8 commits September 22, 2020 09:55

Update src/transformers/configuration_layoutlm.py

e581500

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Update src/transformers/configuration_layoutlm.py

f462e31

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Update src/transformers/configuration_layoutlm.py

9bad86a

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Update src/transformers/configuration_layoutlm.py

817d47e

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Update src/transformers/modeling_layoutlm.py

ac48d0f

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Update src/transformers/modeling_layoutlm.py

82fd94f

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Update src/transformers/modeling_layoutlm.py

c026fd4

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

add mish to activations.py, import ACT2FN and import logging from utils

4e95220

LysandreJik approved these changes Sep 22, 2020

View reviewed changes

LysandreJik merged commit cd9a058 into huggingface:master Sep 22, 2020

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Add LayoutLM Model (huggingface#7064)"

984cb63

This reverts commit 9beb08c.

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LayoutLM Model #7064

Add LayoutLM Model #7064

liminghao1630 commented Sep 11, 2020 •

edited

Loading

codecov bot commented Sep 11, 2020 •

edited

Loading

JetRunner left a comment

JetRunner Sep 15, 2020

JetRunner Sep 15, 2020

sgugger left a comment

liminghao1630 commented Sep 21, 2020

sgugger left a comment

LysandreJik left a comment

liminghao1630 commented Sep 22, 2020

LysandreJik left a comment

Add LayoutLM Model #7064

Add LayoutLM Model #7064

Conversation

liminghao1630 commented Sep 11, 2020 • edited Loading

Introduction

Typical workflow for including a model

Adding model/configuration/tokenization classes

Adding conversion scripts

Adding tests:

Documenting your model:

Final steps

codecov bot commented Sep 11, 2020 • edited Loading

Codecov Report

JetRunner left a comment

Choose a reason for hiding this comment

JetRunner Sep 15, 2020

Choose a reason for hiding this comment

JetRunner Sep 15, 2020

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

liminghao1630 commented Sep 21, 2020

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

liminghao1630 commented Sep 22, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

liminghao1630 commented Sep 11, 2020 •

edited

Loading

codecov bot commented Sep 11, 2020 •

edited

Loading