LayoutLMv3

Third version of LayoutLM introduced by Huang et al.. Compared to the second version, v3 simplifies the model architecture to a single Transformer avoiding the need of CNN-based visual encoder.

Architecture

Embeddings

The Transformer receives text and image tokens. Text tokens are those obtained by OCR accompanied by 1D positional embeddings and 2D layout embeddings, where the layout embeddings refer to the bounding box of a text segment, where the token was found. In other words, tokens in the same segment have the same 2D layout embeddings.

Image tokens are flattened-out parts of resized image processed by a linear layer. This comes from Visual Transformer (ViT) and (ViLT). (TODO)

Self-attention

V3 uses the same spatial-aware self-attention as did v2.

Pretraining

The model is pretrained using 3 losses: MLM, MIM, WPA.

Adjusted Masked Language Modelling (MLM) loss masks out the semantic parts of text embedding (1D and 2D embeddings are left unmasked). The goal is to predict masked text tokens.

Masked Image Modelling (MIM) is a mirror image of MLM but for image. The loss masks out some image tokens and trains the model to predict lower-dimensional representation of the masked out token. (TODO: connection to DALLE through the mentioned paper)

Word-Patch Alignment (WPA) loss forces the model to align the two modalities. The model's goal is to predict for unmasked text tokens if their corresponding image patch is masked by MIM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layoutlm_v3.md

layoutlm_v3.md

LayoutLMv3

Architecture

Embeddings

Self-attention

Pretraining

Files

layoutlm_v3.md

Latest commit

History

layoutlm_v3.md

File metadata and controls

LayoutLMv3

Architecture

Embeddings

Self-attention

Pretraining