Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297

SiriusPoint · 2024-04-16T19:46:49Z

I have trained the Donut model using custom dataset which is on the same line as CORD-v2 dataset. The image is having multiple values in one line and we have around 23 to 24 lines in each document. I have used the base model as "naver-clova-ix/donut-base".
I am using 149 documents for the training and following is the breakup of the datasets
training = 119 images
validation = 22 images
testing = 8 images

I have crated 3 meradata.jsonl file i.e. for train, validation and test. Below is the sample value from the metadat.jsonl file from the training database

{"file_name": "IOB_Bank_31_image_0.jpg", "ground_truth": "{\"gt_parse\": {\"bank_stmt_entries\": [{\"TXN_DATE\": \"02-11-2023\", \"TXN_DESC\": \"SB Int: 10-2023:0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"93.00\", \"BALANCE_AMT\": \"10901.92\"}, {\"TXN_DATE\": \"09-12-2023\", \"TXN_DESC\": \"CHRGS- SMS ALERT\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"1.06\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10900.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"Debit Card AMC-2\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"295.00\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10605.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"SB Int: 01-2024: 0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"75,00\", \"BALANCE_AMT\": \"10680.86\"}]}}"}

I trained the model for 30 epochs and following are the values for loss and val_edit_distance

loss = 0.03544
val_edit_distance = 0.3443

Following is the config parameters used for the training

"max_epochs":30,
"val_check_interval":0.2, # how many times we want to validate during an epoch
"check_val_every_n_epoch":1,
"gradient_clip_val":1.0,
"num_training_samples_per_epoch": 119,
"lr":3e-5,
"train_batch_sizes": [2],
"val_batch_sizes": [1],
"num_nodes": 1,
"warmup_steps": 180, # 800/8*30/10, 10%
"result_path": "/content/drive/MyDrive/universal-bank-statement-reader/processed-dataset/result",
"verbose": True,

When I am trying to find the prediction using the test dataset, I am getting following output because I had put the print statement at specific location

seq ==>: <s_bank-stmt>署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.

seq after token2json ==>: {'text_sequence': '署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.'}

ground_truth after json load ==>: {'gt_parse': {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}}

ground_truth ==>: {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}

evaluator ==>: <donut.util.JSONParseEvaluator object at 0x7d697edbfc10>

score ==>: 0

I had referred following URL as reference
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb

Please help me out in identifying and revolve the issue and let me know if you need more information

Thank you in advance

The text was updated successfully, but these errors were encountered:

CarlosSerrano88 · 2024-06-16T08:18:03Z

@SiriusPoint any updates? I have the same problem

SiriusPoint · 2024-06-18T04:20:43Z

@CarlosSerrano88, Not yet. I am trying but not getting appropriate results.

banditgoose · 2024-06-24T23:19:52Z

The transformers implementation of Donut seems to have broken saving and loading at some point. Try transformers==4.26.1 and see if that works.

dreamlychina · 2024-06-27T08:07:08Z

any updates? I have the same problem

+1

CarlosSerrano88 · 2024-06-27T08:38:45Z

with transformers==4.25.1 working perfect!

dgarlor · 2024-07-12T12:20:57Z

I have the same problem with the last version of transformers. Going back to 4.40.1, and the saved model works again

SiriusPoint changed the title ~~Not getting prediction correctly using the model trained on the custom database~~ Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297

Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297

SiriusPoint commented Apr 16, 2024

CarlosSerrano88 commented Jun 16, 2024

SiriusPoint commented Jun 18, 2024

banditgoose commented Jun 24, 2024

dreamlychina commented Jun 27, 2024

CarlosSerrano88 commented Jun 27, 2024

dgarlor commented Jul 12, 2024

Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297

Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297

Comments

SiriusPoint commented Apr 16, 2024

evaluator ==>: <donut.util.JSONParseEvaluator object at 0x7d697edbfc10>

score ==>: 0

CarlosSerrano88 commented Jun 16, 2024

SiriusPoint commented Jun 18, 2024

banditgoose commented Jun 24, 2024

dreamlychina commented Jun 27, 2024

CarlosSerrano88 commented Jun 27, 2024

dgarlor commented Jul 12, 2024