Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure sroie_data_preprocessing.py for expand CLASS_LIST #27

Open
kerberosargos opened this issue Jun 3, 2024 · 9 comments
Open

Comments

@kerberosargos
Copy link

kerberosargos commented Jun 3, 2024

Hello again. I need a help about expand CLASS_LIST. Firstly thank you for your support in advance

I have configured
SROIE_CLASS_LIST = ["others", "company", "address", "document_number", "date_time", "total", "tax"]

Sample box file is as following

182,70,435,70,435,110,182,110,BURGER KING,company
97,112,512,112,512,155,97,155,EKUR İNŞAAT SANAYİ VE TİCARET A.Ş.,other
42,152,570,152,570,200,42,200,MEVLANA MH.Ç.MEHMET CD. NO:33/A MARMARAPARK,address
70,194,544,194,544,242,70,242,AVM. 3F02 ESENLER/İST. TİC. SİC. NO:300241,address
95,238,522,242,522,291,94,287,BOĞAZİÇİ KURUMLAR V.D.330 005 3911,other
44,312,177,312,177,360,44,360,13/05/2024,date_time
390,315,570,315,570,362,390,362,FİŞ NO: 000132,document_number
47,360,192,360,192,407,47,407,SAAT: 21:17,date_time
60,435,265,435,265,482,60,482,1 2TB+K.IC+K.PAT,other
307,440,350,440,350,482,307,482,%10,other
447,437,542,437,542,482,447,482,*119,99,other
87,482,242,482,242,527,87,527,2 TAVUKBRGER,other
472,485,540,485,540,527,472,527,*0,00,other
307,487,350,487,350,527,307,527,%10,other
115,530,347,530,347,575,115,575,1 + Peynir Ekle %10,other
460,530,537,530,537,572,460,572,*10,00,other
470,575,535,575,535,617,470,617,*0,00,other
142,577,347,577,347,620,142,620,+ DomatesEkle %10,other
120,618,278,623,277,667,118,662,1 + TursuEkle,other
305,620,347,620,347,662,305,662,%10,other
467,620,532,620,532,660,467,660,*0,00,other
467,662,530,662,530,702,467,702,*0,00,other
120,665,347,665,347,707,120,707,1 + Sogan Ekle %10,other
97,705,245,705,245,747,97,747,1 KUCUKAYRAN,other
465,705,530,705,530,745,465,745,*0,00,other
305,707,347,707,347,745,305,745,%10,other
465,745,527,745,527,787,465,787,*9,00,other
97,747,232,747,232,790,97,790,1 O.PATATES,other
305,750,345,750,345,787,305,787,%10,other
462,787,527,787,527,827,462,827,*0,00,other
100,790,200,790,200,832,100,832,1 KETCAP,other
305,790,345,790,345,827,305,827,%10,other
462,827,525,827,525,867,462,867,*0,00,other
100,832,210,832,210,872,100,872,1 MAYONEZ,other
305,832,345,832,345,870,305,870,%10,other
102,870,245,870,245,912,102,912,1 ISTENMIYOR,other
460,870,522,870,522,910,460,910,*0,00,other
305,872,345,872,345,910,305,910,%10,other
460,910,522,910,522,947,460,947,*0,00,other
102,912,245,912,245,952,102,952,1 ISTENMIYOR,other
302,912,345,912,345,950,302,950,%10,other
447,987,520,987,520,1027,447,1027,*12,64,tax
72,990,210,990,210,1030,72,1030,TOPKDV,other
435,1025,517,1025,517,1065,435,1065,*138,99,total
75,1027,209,1027,209,1070,75,1070,TOPLAM,other
432,1102,517,1102,517,1142,432,1142,*138,99,other
75,1107,137,1107,137,1145,75,1145,NAKİT,other
75,1145,290,1145,290,1182,75,1182,POS:3 RefNum:30122,other
119,1204,487,1200,487,1242,120,1246,Sipariş Numarası:,other
250,1242,342,1242,342,1282,250,1282,3122,other
74,1307,234,1304,235,1346,75,1349,Kasiyer: 25620,other
74,1352,537,1347,537,1379,75,1385,*********************************************************************************************************************************,other
77,1387,504,1385,505,1427,77,1430,Asagidaki web sitesinde anket doldurun.,other
75,1427,375,1427,375,1467,75,1467,King boy secim bedava alin.,other
77,1470,367,1470,367,1507,77,1507,www.burgerkingdeneyimi.com,other
77,1508,334,1505,335,1547,77,1550,Sifre: 2182851100391240,other
77,1550,245,1550,245,1590,77,1590,Doğrulama Kodu:,other
77,1589,477,1589,477,1629,77,1629,Sifre ve dogrulama kodu alindigindan,other
79,1631,504,1626,505,1668,80,1673,itibaren 15 gun icinde kullanilmalidir,other
77,1672,477,1668,477,1709,77,1712,Sartlar ve icerik web sayfasindadir.,other
79,1717,544,1709,545,1739,80,1747,*********************************************************************************************************************************,other
80,1770,272,1770,272,1807,80,1807,KASİYER:KASİYER 2,other
417,1820,542,1820,542,1852,417,1852,EKÜ NO:0003,other
84,1826,209,1823,210,1857,85,1860,Z NO:001532,other
209,1879,389,1879,389,1904,209,1904,NF 3E 20040058,other

Sample key file is as following

{
    "company": "BURGER KING",
    "address": "MEVLANA MH.Ç.MEHMET CD. NO:33/A MARMARAPARK AVM. 3F02 ESENLER/İST. TİC. SİC. NO:300241",
    "document_number": "FİŞ NO: 000132",
    "date_time": "13/05/2024 SAAT: 21:17",
    "total": "*138,99",
    "tax": "*12,64"
}

According to above data how can I modify is following code.

And I do not want to use regex for fix data pattern. I would like to modify like just raw text

 total_float = re.search(r"([-+]?[0-9]*\.?[0-9]+)", key_info["total"])
    for index, row in gt_dataframe.iterrows():
        # default value
        gt_dataframe.loc[index, "pos_neg"] = 2

        # retrieve 'company' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[0].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 1
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'address' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[2].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 3
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'date' in gt_dataframe
        tab_date = re.findall(
            date_regex,
            row["text"],
        )
        for date in tab_date:
            if date[0] == key_info["date"]:
                gt_dataframe.loc[index, "data_class"] = 2
                gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'total' in gt_dataframe
        tab_floats = re.findall(r"([-+]?[0-9]*\.?[0-9]+)", row["text"])
        if total_float:
            for float_ in tab_floats:
                if float(total_float.group(0)) == float(float_):
                    gt_dataframe.loc[index, "data_class"] = 4
                    gt_dataframe.loc[index, "pos_neg"] = 1

 return gt_dataframe, image_shape

@kerberosargos
Copy link
Author

Hello again, my changed params according to expanded SROIE_CLASS_LIST is as following

TAG_TO_IDX: {'O': 0, 'B-company': 1, 'B-address': 2, 'B-document_number': 3, 'B-date_time': 4, 'B-total': 5, 'B-tax': 6}

TAG_TO_IDX_BIO: {'O': 0, 'B-company': 1, 'I-company': 2, 'B-address': 3, 'I-address': 4, 'B-document_number': 5, 'I-document_number': 6, 'B-date_time': 7, 'I-date_time': 8, 'B-total': 9, 'I-total': 10, 'B-tax': 11, 'I-tax': 12}

after that I have changed process code as following

# total_float = re.search(r"([-+]?[0-9]*\.?[0-9]+)", key_info["total"])
    for index, row in gt_dataframe.iterrows():
        # default value
        gt_dataframe.loc[index, "pos_neg"] = 2

        # retrieve 'company' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[0].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 1
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'address' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[1].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 2
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'document_number' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[2].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 3
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'date_time' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[3].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 4
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'total' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[4].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 5
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'tax' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[5].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 6
            gt_dataframe.loc[index, "pos_neg"] = 1



        # # retrieve 'date' in gt_dataframe
        # tab_date = re.findall(
        #     date_regex,
        #     row["text"],
        # )
        # for date in tab_date:
        #     if date[0] == key_info["date"]:
        #         gt_dataframe.loc[index, "data_class"] = 2
        #         gt_dataframe.loc[index, "pos_neg"] = 1

        # # retrieve 'total' in gt_dataframe
        # tab_floats = re.findall(r"([-+]?[0-9]*\.?[0-9]+)", row["text"])
        # if total_float:
        #     for float_ in tab_floats:
        #         if float(total_float.group(0)) == float(float_):
        #             gt_dataframe.loc[index, "data_class"] = 4
        #             gt_dataframe.loc[index, "pos_neg"] = 1

    return gt_dataframe, image_shape

my result has created as following which is in train_processed\ocr_result dir

,left,top,right,bot,text,data_class,pos_neg
0,182,70,261,110,BURGER,2,1
1,275,70,434,110,"KING,company",2,1
2,97,112,138,155,EKUR,0,2
3,148,112,210,155,İNŞAAT,0,2
4,220,112,282,155,SANAYİ,0,2
5,292,112,312,155,VE,0,2
6,323,112,395,155,TİCARET,0,2
7,406,112,509,155,"A.Ş.,other",0,2
8,42,152,114,200,MEVLANA,0,2
9,124,152,237,200,MH.Ç.MEHMET,0,2
10,248,152,279,200,CD.,0,2
11,289,152,361,200,NO:33/A,4,1
12,371,152,567,200,"MARMARAPARK,address",0,2
13,70,194,107,242,AVM.,0,2
14,117,194,154,242,3F02,0,2
15,164,194,277,242,ESENLER/İST.,0,2
16,287,194,324,242,TİC.,0,2
17,334,194,371,242,SİC.,0,2
18,381,194,542,242,"NO:300241,address",3,1
19,95,238,180,291,BOĞAZİÇİ,0,2
20,191,238,276,291,KURUMLAR,0,2
21,287,238,361,291,V.D.330,0,2
22,372,238,404,291,005,0,2
23,414,238,520,291,"3911,other",0,2
24,44,312,177,360,"13/05/2024,date_time",5,1
25,390,315,408,362,FİŞ,4,1
26,414,315,432,362,NO:,4,1
27,438,315,570,362,"000132,document_number",4,1
28,47,360,81,407,SAAT:,5,1
29,88,360,191,407,"21:17,date_time",5,1
30,60,435,69,482,1,0,2
31,78,435,264,482,"2TB+K.IC+K.PAT,other",0,2
32,307,440,350,482,"%10,other",0,2
33,447,437,542,482,"*119,99,other",6,1
34,87,482,95,527,2,0,2
35,104,482,241,527,"TAVUKBRGER,other",0,2
36,472,485,540,527,"*0,00,other",0,2
37,307,487,350,527,"%10,other",0,2
38,115,530,124,575,1,0,2
39,133,530,142,575,+,0,2
40,151,530,206,575,Peynir,0,2
41,215,530,252,575,Ekle,0,2
42,261,530,344,575,"%10,other",0,2
43,460,530,537,572,"*10,00,other",0,2
44,470,575,535,617,"*0,00,other",0,2
45,142,577,150,620,+,0,2
46,159,577,257,620,DomatesEkle,0,2
47,265,577,345,620,"%10,other",0,2
48,120,618,128,667,1,0,2
49,136,618,144,667,+,0,2
50,152,618,275,667,"TursuEkle,other",0,2
51,305,620,347,662,"%10,other",0,2
52,467,620,532,660,"*0,00,other",0,2
53,467,662,530,702,"*0,00,other",0,2
54,120,665,129,707,1,0,2
55,138,665,147,707,+,0,2
56,156,665,203,707,Sogan,0,2
57,212,665,249,707,Ekle,0,2
58,259,665,344,707,"%10,other",0,2
59,97,705,105,747,1,0,2
60,113,705,244,747,"KUCUKAYRAN,other",0,2
61,465,705,530,745,"*0,00,other",0,2
62,305,707,347,745,"%10,other",0,2
63,465,745,527,787,"*9,00,other",0,2
64,97,747,104,790,1,0,2
65,112,747,231,790,"O.PATATES,other",0,2
66,305,750,345,787,"%10,other",0,2
67,462,787,527,827,"*0,00,other",0,2
68,100,790,107,832,1,0,2
69,114,790,199,832,"KETCAP,other",0,2
70,305,790,345,827,"%10,other",0,2
71,462,827,525,867,"*0,00,other",0,2
72,100,832,107,872,1,0,2
73,114,832,209,872,"MAYONEZ,other",0,2
74,305,832,345,870,"%10,other",0,2
75,102,870,109,912,1,0,2
76,117,870,244,912,"ISTENMIYOR,other",0,2
77,460,870,522,910,"*0,00,other",0,2
78,305,872,345,910,"%10,other",0,2
79,460,910,522,947,"*0,00,other",0,2
80,102,912,109,952,1,0,2
81,117,912,244,952,"ISTENMIYOR,other",0,2
82,302,912,345,950,"%10,other",0,2
83,447,987,520,1027,"*12,64,tax",0,2
84,72,990,210,1030,"TOPKDV,other",0,2
85,435,1025,517,1065,"*138,99,total",6,1
86,75,1027,209,1070,"TOPLAM,other",0,2
87,432,1102,517,1142,"*138,99,other",6,1
88,75,1107,137,1145,"NAKİT,other",0,2
89,75,1145,119,1182,POS:3,0,2
90,128,1145,289,1182,"RefNum:30122,other",0,2
91,119,1204,231,1242,Sipariş,0,2
92,247,1204,487,1242,"Numarası:,other",0,2
93,250,1242,342,1282,"3122,other",0,2
94,74,1307,138,1346,Kasiyer:,0,2
95,146,1307,234,1346,"25620,other",0,2
96,74,1352,537,1379,"*********************************************************************************************************************************,other",0,2
97,77,1387,162,1427,Asagidaki,0,2
98,172,1387,200,1427,web,0,2
99,210,1387,295,1427,sitesinde,0,2
100,305,1387,352,1427,anket,0,2
101,362,1387,504,1427,"doldurun.,other",0,2
102,75,1427,111,1467,King,2,1
103,120,1427,147,1467,boy,0,2
104,156,1427,201,1467,secim,0,2
105,210,1427,264,1467,bedava,0,2
106,273,1427,373,1467,"alin.,other",0,2
107,77,1470,367,1507,"www.burgerkingdeneyimi.com,other",0,2
108,77,1508,130,1547,Sifre:,0,2
109,139,1508,334,1547,"2182851100391240,other",0,2
110,77,1550,149,1590,Doğrulama,0,2
111,157,1550,245,1590,"Kodu:,other",0,2
112,77,1589,124,1629,Sifre,0,2
113,134,1589,153,1629,ve,0,2
114,162,1589,247,1629,dogrulama,0,2
115,257,1589,295,1629,kodu,0,2
116,304,1589,475,1629,"alindigindan,other",0,2
117,79,1631,156,1668,itibaren,0,2
118,166,1631,185,1668,15,0,2
119,195,1631,224,1668,gun,0,2
120,233,1631,291,1668,icinde,0,2
121,300,1631,503,1668,"kullanilmalidir,other",0,2
122,77,1672,143,1709,Sartlar,0,2
123,153,1672,172,1709,ve,0,2
124,181,1672,238,1709,icerik,0,2
125,247,1672,275,1709,web,0,2
126,285,1672,475,1709,"sayfasindadir.,other",0,2
127,79,1717,545,1739,"*********************************************************************************************************************************,other",0,2
128,80,1770,205,1807,KASİYER:KASİYER,0,2
129,213,1770,271,1807,"2,other",0,2
130,417,1820,439,1852,EKÜ,0,2
131,446,1820,541,1852,"NO:0003,other",0,2
132,84,1826,91,1857,Z,0,2
133,98,1826,209,1857,"NO:001532,other",0,2
134,209,1879,227,1904,NF,0,2
135,236,1879,254,1904,3E,0,2
136,263,1879,389,1904,"20040058,other",0,2

I think everything is not okay? am I wrong?

Thank you in advance

@ZeningLin
Copy link
Owner

Sorry for my delayed response. Are you currently working on a custom dataset or simply expanding the category types of SROIE?

@ZeningLin
Copy link
Owner

If you are making modifications to the SROIE dataset, one approach could be to retrieve the OCR content of the key fields by utilizing string similarity. By doing so, you may obtain multiple results. To determine the desired result, you can rely on the coordinates. For instance, fields such as "tax" might have a string adjacent to it that contains the keyword "TAX".

It is worth noting that the accuracy of the matched labels can significantly impact the final performance. If it is feasible, I highly recommend considering manual labeling of the OCR results for better performance.

@kerberosargos
Copy link
Author

Sorry for my delayed response. Are you currently working on a custom dataset or simply expanding the category types of SROIE?

Thank you for your interest. I am working my own dataset not original Sroie dataset

@kerberosargos
Copy link
Author

kerberosargos commented Jun 4, 2024

Acctualy I have build my own dataset on SROIE's dataset stucture.

I mean I have a image, box txt file and json key txt. Everthing okay on my side. Bbox coordinates and ocr result text data are correct.

Now I am trying to convert my custom sroie stucture dataset to your model by using sroie_data_preprocess.py file.

According to this intorduce

  1. How to modify def ground_truth_extraction( for expanded my SRORIE_CLASS_LIST in pipeline/sroie_data_preprocessing.py

  2. And must I use for same function's split_word param's value as TRUE for better resut?

@ZeningLin
Copy link
Owner

Acctualy I have build my own dataset on SROIE's dataset stucture.

I mean I have a image, box txt file and json key txt. Everthing okay on my side. Bbox coordinates and ocr result text data are correct.

Now I am trying to convert my custom sroie stucture dataset to your model by using sroie_data_preprocess.py file.

According to this intorduce

  1. How to modify def ground_truth_extraction( for expanded my SRORIE_CLASS_LIST in pipeline/sroie_data_preprocessing.py
  2. And must I use for same function's split_word param's value as TRUE for better resut?
  1. In def ground_truth_extraction, rules for matching the key fields in SROIE (company, date, address, total) are provided. For your custom dataset, you may directly employ the similarity matching strategy in my code to retrieve company, address, document_number, and date_time (your modified codes for these categories are correct). For fields like total and tax, the optimal solution is to employ the regex expression.
  2. Based on my experience, a larger granularity (line-level or paragraph-level) may lead to better results, but it varies across datasets. You may try both the line-level and the word-level annotations to find the optimal one.

@kerberosargos
Copy link
Author

Thank you again. I do not want to use regex for matching. it could be any problem for best result? And as following code modification is correct?

SROIE_CLASS_LIST = ["others", "company", "address", "document_number", "date_time", "total", "tax"]

TAG_TO_IDX: {'O': 0, 'B-company': 1, 'B-address': 2, 'B-document_number': 3, 'B-date_time': 4, 'B-total': 5, 'B-tax': 6}

TAG_TO_IDX_BIO: {'O': 0, 'B-company': 1, 'I-company': 2, 'B-address': 3, 'I-address': 4, 'B-document_number': 5, 'I-document_number': 6, 'B-date_time': 7, 'I-date_time': 8, 'B-total': 9, 'I-total': 10, 'B-tax': 11, 'I-tax': 12}

In def ground_truth_extraction,

    for index, row in gt_dataframe.iterrows():
        # default value
        gt_dataframe.loc[index, "pos_neg"] = 2

        # retrieve 'company' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[0].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 1
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'address' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[1].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 2
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'document_number' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[2].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 3
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'date_time' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[3].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 4
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'total' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[4].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 5
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'tax' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[5].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 6
            gt_dataframe.loc[index, "pos_neg"] = 1

    return gt_dataframe, image_shape

@ZeningLin
Copy link
Owner

I think your code can handle the case well. You may set different cosine_sim_treshold for each category to obtain the optimal result.

@kerberosargos
Copy link
Author

Thank you so much again, for your great project and support. I will try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants