Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[models] Vit: fix intermediate size scale and unify TF to PT #1063

Merged
merged 3 commits into from
Sep 19, 2022

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Sep 16, 2022

This PR:

  • fix intermediate size scaling before: PFF (MLP) dim always 768 (same as d_model) but base model has a size from 3072 so scale x 4
  • unify TF with PT

Any feedback is welcome 🤗

PT: (@frgfm thanks for torch-scan 👍 )

__________________________________________________________________________________________
Layer                        Type                  Output Shape              Param #
==========================================================================================
visiontransformer            VisionTransformer     (-1, 126)                 0
├─0                          PatchEmbedding        (-1, 65, 768)             37,632
├─1                          EncoderBlock          (-1, 65, 768)             85,022,208
├─2                          ClassifierHead        (-1, 126)                 96,894
==========================================================================================
Trainable params: 85,207,422
Non-trainable params: 0
Total params: 85,207,422
------------------------------------------------------------------------------------------
Model size (params + buffers): 325.04 Mb
Framework & CUDA overhead: 1575.00 Mb
Total RAM usage: 1900.04 Mb
------------------------------------------------------------------------------------------
Floating Point Operations on forward: 6.25 MFLOPs
Multiply-Accumulations on forward: 405.06 kMACs
Direct memory accesses on forward: 102.08 MDMAs
__________________________________________________________________________________________

TF:

_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 patch_embedding (PatchEmbed  (1, 65, 768)             88320
 ding)

 encoder_block (EncoderBlock  (1, 65, 768)             85022208
 )

 classifier_head (Classifier  (1, 126)                 96894
 Head)

=================================================================
Total params: 85,207,422
Trainable params: 85,207,422
Non-trainable params: 0

As you can see the models are similar (only PatchEmbed is different PT: -> linear proj / TF Conv2D proj)
PT model compared with timm's implementation our: ~6,5 GB VRAM timm's: ~7GB VRAM
TF model: ~15GB VRAM @frgfm do you know any reason why ? 😅

Additional timm's implementation:

__________________________________________________________________________________________
Layer                        Type                  Output Shape              Param #        
==========================================================================================
visiontransformer            VisionTransformer     (-1, 126)                 0              
├─patch_embed                PatchEmbed            (-1, 64, 768)             37,632         
├─pos_drop                   Dropout               (-1, 65, 768)             0              
├─blocks                     Sequential            (-1, 65, 768)             85,054,464     
├─norm                       LayerNorm             (-1, 65, 768)             1,536          
├─fc_norm                    Identity              (-1, 768)                 0              
├─head                       Linear                (-1, 126)                 96,894         
==========================================================================================
Trainable params: 85,241,214
Non-trainable params: 0
Total params: 85,241,214
------------------------------------------------------------------------------------------
Model size (params + buffers): 325.17 Mb
Framework & CUDA overhead: 1575.00 Mb
Total RAM usage: 1900.17 Mb
------------------------------------------------------------------------------------------
Floating Point Operations on forward: 10.71 MFLOPs
Multiply-Accumulations on forward: 2.66 MMACs
Direct memory accesses on forward: 108.10 MDMAs

with this PR: (TF is mostly identical)

(doctr-dev) felix@felix-GS66-Stealth-11UH:~/Desktop/doctr$ python3 /home/felix/Desktop/doctr/references/classification/train_pytorch.py vit_b
2022-09-16 10:03:43.296373: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Namespace(amp=False, arch='vit_b', batch_size=64, device=None, epochs=10, export_onnx=False, find_lr=False, font='FreeMono.ttf,FreeSans.ttf,FreeSerif.ttf', input_size=32, lr=0.001, name=None, pretrained=False, push_to_hub=False, resume=None, sched='cosine', show_samples=False, test_only=False, train_samples=1000, val_samples=20, vocab='french', wb=False, weight_decay=0, workers=None)
Validation set loaded in 0.2844s (2520 samples in 40 batches)
Train set loaded in 0.2508s (126000 samples in 1968 batches)
Validation loss decreased inf --> 0.991241: saving state...                                                                                                     
Epoch 1/10 - Validation loss: 0.991241 (Acc: 70.79%)
Validation loss decreased 0.991241 --> 0.758742: saving state...                                                                                                
Epoch 2/10 - Validation loss: 0.758742 (Acc: 77.86%)
Epoch 3/10 - Validation loss: 1.20299 (Acc: 71.55%)                                                                                                             
Validation loss decreased 0.758742 --> 0.347141: saving state...                                                                                                
Epoch 4/10 - Validation loss: 0.347141 (Acc: 86.87%)
Validation loss decreased 0.347141 --> 0.308255: saving state...                                                                                                
Epoch 5/10 - Validation loss: 0.308255 (Acc: 88.69%)
Validation loss decreased 0.308255 --> 0.277491: saving state...                                                                                                
Epoch 6/10 - Validation loss: 0.277491 (Acc: 89.88%)
Validation loss decreased 0.277491 --> 0.153586: saving state...                                                                                                
Epoch 7/10 - Validation loss: 0.153586 (Acc: 94.96%)
Validation loss decreased 0.153586 --> 0.0993369: saving state...                                                                                               
Epoch 8/10 - Validation loss: 0.0993369 (Acc: 96.79%)
Validation loss decreased 0.0993369 --> 0.0867528: saving state...                                                                                              
Epoch 9/10 - Validation loss: 0.0867528 (Acc: 97.06%)
Validation loss decreased 0.0867528 --> 0.0744964: saving state...                                                                                              
Epoch 10/10 - Validation loss: 0.0744964 (Acc: 97.90%)
(doctr-dev-tf) felix@felix-GS66-Stealth-11UH:~/Desktop/doctr$ python3 /home/felix/Desktop/doctr/references/classification/train_tensorflow.py vit_b
Namespace(amp=False, arch='vit_b', batch_size=64, epochs=10, export_onnx=False, find_lr=False, font='FreeMono.ttf,FreeSans.ttf,FreeSerif.ttf', input_size=32, lr=0.001, name=None, pretrained=False, push_to_hub=False, resume=None, show_samples=False, test_only=False, train_samples=1000, val_samples=20, vocab='french', wb=False, workers=None)
Validation set loaded in 1.145s (2520 samples in 40 batches)
Train set loaded in 1.148s (126000 samples in 1968 batches)
Validation loss decreased inf --> 0.142181: saving state...                                                                                                     
Epoch 1/10 - Validation loss: 0.142181 (Acc: 95.83%)
Validation loss decreased 0.142181 --> 0.0494551: saving state...                                                                                               
Epoch 2/10 - Validation loss: 0.0494551 (Acc: 98.21%)
Validation loss decreased 0.0494551 --> 0.0102294: saving state...                                                                                              
Epoch 3/10 - Validation loss: 0.0102294 (Acc: 99.44%)

@felixdittrich92 felixdittrich92 self-assigned this Sep 16, 2022
@felixdittrich92 felixdittrich92 added this to the 0.6.0 milestone Sep 16, 2022
@felixdittrich92 felixdittrich92 added type: bug Something isn't working module: models Related to doctr.models framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend topic: character classification Related to the task of character classification labels Sep 16, 2022
@codecov
Copy link

codecov bot commented Sep 16, 2022

Codecov Report

Merging #1063 (717c4ba) into main (a95baaa) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1063      +/-   ##
==========================================
+ Coverage   95.16%   95.17%   +0.01%     
==========================================
  Files         141      141              
  Lines        5827     5821       -6     
==========================================
- Hits         5545     5540       -5     
+ Misses        282      281       -1     
Flag Coverage Δ
unittests 95.17% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/models/classification/vit/pytorch.py 100.00% <ø> (ø)
doctr/models/modules/transformer/pytorch.py 100.00% <ø> (ø)
doctr/models/modules/transformer/tensorflow.py 99.03% <ø> (ø)
doctr/models/modules/vision_transformer/pytorch.py 100.00% <ø> (ø)
doctr/models/classification/vit/tensorflow.py 100.00% <100.00%> (+2.17%) ⬆️
doctr/transforms/modules/base.py 94.59% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Felix 👏
one comment related to VIT PRs, and another on a docstring typo!

Regarding TF, from the graph, it looks like the patch embedding is not efficient memory wise (as it's the only structural diff)

doctr/models/classification/vit/tensorflow.py Show resolved Hide resolved
doctr/models/classification/vit/pytorch.py Outdated Show resolved Hide resolved
doctr/models/classification/vit/tensorflow.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@odulcy-mindee odulcy-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixdittrich92 ! 👍

@felixdittrich92 felixdittrich92 merged commit 4e763da into mindee:main Sep 19, 2022
@felixdittrich92 felixdittrich92 deleted the vit-bug branch September 19, 2022 06:51
Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Felix 🙏

@felixdittrich92 felixdittrich92 mentioned this pull request Sep 26, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework: pytorch Related to PyTorch backend framework: tensorflow Related to TensorFlow backend module: models Related to doctr.models topic: character classification Related to the task of character classification type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants