Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297

Open
SiriusPoint opened this issue Apr 16, 2024 · 6 comments

Comments

@SiriusPoint
Copy link

I have trained the Donut model using custom dataset which is on the same line as CORD-v2 dataset. The image is having multiple values in one line and we have around 23 to 24 lines in each document. I have used the base model as "naver-clova-ix/donut-base".
I am using 149 documents for the training and following is the breakup of the datasets
training = 119 images
validation = 22 images
testing = 8 images

I have crated 3 meradata.jsonl file i.e. for train, validation and test. Below is the sample value from the metadat.jsonl file from the training database

{"file_name": "IOB_Bank_31_image_0.jpg", "ground_truth": "{\"gt_parse\": {\"bank_stmt_entries\": [{\"TXN_DATE\": \"02-11-2023\", \"TXN_DESC\": \"SB Int: 10-2023:0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"93.00\", \"BALANCE_AMT\": \"10901.92\"}, {\"TXN_DATE\": \"09-12-2023\", \"TXN_DESC\": \"CHRGS- SMS ALERT\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"1.06\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10900.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"Debit Card AMC-2\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"295.00\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10605.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"SB Int: 01-2024: 0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"75,00\", \"BALANCE_AMT\": \"10680.86\"}]}}"}

I trained the model for 30 epochs and following are the values for loss and val_edit_distance

loss = 0.03544
val_edit_distance = 0.3443

Following is the config parameters used for the training

  • "max_epochs":30,
  • "val_check_interval":0.2, # how many times we want to validate during an epoch
  • "check_val_every_n_epoch":1,
  • "gradient_clip_val":1.0,
  • "num_training_samples_per_epoch": 119,
  • "lr":3e-5,
  • "train_batch_sizes": [2],
  • "val_batch_sizes": [1],
  • "num_nodes": 1,
  • "warmup_steps": 180, # 800/8*30/10, 10%
  • "result_path": "/content/drive/MyDrive/universal-bank-statement-reader/processed-dataset/result",
  • "verbose": True,

When I am trying to find the prediction using the test dataset, I am getting following output because I had put the print statement at specific location

seq ==>: <s_bank-stmt>署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.
seq after token2json ==>: {'text_sequence': '署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.'}
ground_truth after json load ==>: {'gt_parse': {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}}
ground_truth ==>: {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}
evaluator ==>: <donut.util.JSONParseEvaluator object at 0x7d697edbfc10>
score ==>: 0

I had referred following URL as reference
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb

Please help me out in identifying and revolve the issue and let me know if you need more information

Thank you in advance

@SiriusPoint SiriusPoint changed the title Not getting prediction correctly using the model trained on the custom database Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) Apr 17, 2024
@CarlosSerrano88
Copy link

@SiriusPoint any updates? I have the same problem

@SiriusPoint
Copy link
Author

@CarlosSerrano88, Not yet. I am trying but not getting appropriate results.

@banditgoose
Copy link

The transformers implementation of Donut seems to have broken saving and loading at some point. Try transformers==4.26.1 and see if that works.

@dreamlychina
Copy link

any updates? I have the same problem

+1

@CarlosSerrano88
Copy link

with transformers==4.25.1 working perfect!

@dgarlor
Copy link

dgarlor commented Jul 12, 2024

I have the same problem with the last version of transformers. Going back to 4.40.1, and the saved model works again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants