Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I need help about customize entities of SROIE dataset #26

Open
kerberosargos opened this issue May 30, 2024 · 6 comments
Open

I need help about customize entities of SROIE dataset #26

kerberosargos opened this issue May 30, 2024 · 6 comments

Comments

@kerberosargos
Copy link

Hello, firstly thank your for support in advance.

I would like to expand SROIE entities by using my own dataset. is it possible? Example: I would like to change as following array

SROIE_CLASS_LIST = ["others", "company", "date", "address", "total"]

SROIE_CLASS_LIST = ["others", "company", "date", "time", "address", "total", "tax", "sub_total"] etc...

@kerberosargos kerberosargos changed the title I need help about SROIE dataset I need help about customize entities of SROIE dataset May 30, 2024
@ZeningLin
Copy link
Owner

Yes, it is possible. The main modification lies in the number of categories and the corresponding mappings. Change the SROIE_CLASS_LIST, TAG_TO_IDX, and TAG_TO_IDX_BIO in train_SROIE.py and eval_SROIE.py to your custom entity type, then change the num_classes term in the config yaml file. You may also need to modify the postprocessing rules in eval_SROIE.py accordingly.

@kerberosargos
Copy link
Author

Thank you very much for your very fast answer. But I did not understand how modify B- or I- tag. Can you modify for me, according to my expand sample

SROIE_CLASS_LIST = ["others", "company", "date", "address", "total"]

TAG_TO_IDX = {
    "O": 0,
    "B-company": 1,
    "B-date": 2,
    "B-address": 3,
    "B-total": 4,
}

TAG_TO_IDX_BIO = {
    "O": 0,
    "B-company": 1,
    "I-company": 2,
    "B-date": 3,
    "I-date": 4,
    "B-address": 5,
    "I-address": 6,
    "B-total": 7,
    "I-total": 8,
}

@kerberosargos
Copy link
Author

And one more question.

I have to use entities for training SORIE's entities as following

{
    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"
} 

**or just can I use only box and scripts file without entities **

1,83,41,331,41,331,78,83,78,TAN WOON YANN,other
1,109,171,330,171,330,191,109,191,MR D.I.Y. (M) SDN BHD,company
1,122,190,325,190,325,213,122,213,(CO. RFG : 860671-D),other
1,47,208,391,208,391,233,47,233,LOT 1851-A & 1851-B, JALAN KPB 6,,address
1,62,235,381,235,381,254,62,254,KAWASAN PERINDUSTRIAN BALAKONG,,address
1,70,256,384,256,384,275,70,275,43300 SERI KEMBANGAN, SELANGOR,address
1,125,275,318,275,318,297,125,297,(TESCO PUTRA NILAI),other
1,177,295,266,295,266,317,177,317,-INVOICE-,other
1,12,337,402,337,402,362,12,362,KILAT AUTO ECO WASH & SHINE ES1000 1L,other
1,20,360,160,360,160,383,20,383,WA45 /2A - 12,other
1,16,382,156,382,156,402,16,402,9555916500133,other

@ZeningLin
Copy link
Owner

ZeningLin commented May 30, 2024

Thank you very much for your very fast answer. But I did not understand how modify B- or I- tag. Can you modify for me, according to my expand sample

SROIE_CLASS_LIST = ["others", "company", "date", "address", "total"]

TAG_TO_IDX = {
    "O": 0,
    "B-company": 1,
    "B-date": 2,
    "B-address": 3,
    "B-total": 4,
}

TAG_TO_IDX_BIO = {
    "O": 0,
    "B-company": 1,
    "I-company": 2,
    "B-date": 3,
    "I-date": 4,
    "B-address": 5,
    "I-address": 6,
    "B-total": 7,
    "I-total": 8,
}

For example, if your entity types are [others, type1, type2, type3], the corresponding IDX maps should be:

TAG_TO_IDX = {
    "O": 0,    # Remember to keep the background type (others, or O tag) as the first term
    "B-type1": 1,
    "B-type2": 2,
    "B-type3": 3,
}

TAG_TO_IDX_BIO = {
    "O": 0,   # Remember to keep the background type (others, or O tag) as the first term
    "B-type1": 1,
    "I-type1": 2,
    "B-type2": 3,
    "I-type2": 4,
    "B-type3": 5,
    "I-type3": 6,
}

You may also use the following codes to generate the corresponding mappings:

SROIE_CLASS_LIST = ["others", "company", "date", "time", "address", "total", "tax", "sub_total"]

TAG_TO_IDX_ = ["O"]
TAG_TO_IDX_BIO_ = ["O"]
for cls_type in SROIE_CLASS_LIST[1:]:
    TAG_TO_IDX_.append(f"B-{cls_type}")
    TAG_TO_IDX_BIO_.append(f"B-{cls_type}")
    TAG_TO_IDX_BIO_.append(f"I-{cls_type}")

TAG_TO_IDX = {s: i for i, s in enumerate(TAG_TO_IDX_)}
TAG_TO_IDX_BIO = {s: i for i, s in enumerate(TAG_TO_IDX_BIO_)}

@ZeningLin
Copy link
Owner

And one more question.

I have to use entities for training SORIE's entities as following

{
    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"
} 

**or just can I use only box and scripts file without entities **

1,83,41,331,41,331,78,83,78,TAN WOON YANN,other
1,109,171,330,171,330,191,109,191,MR D.I.Y. (M) SDN BHD,company
1,122,190,325,190,325,213,122,213,(CO. RFG : 860671-D),other
1,47,208,391,208,391,233,47,233,LOT 1851-A & 1851-B, JALAN KPB 6,,address
1,62,235,381,235,381,254,62,254,KAWASAN PERINDUSTRIAN BALAKONG,,address
1,70,256,384,256,384,275,70,275,43300 SERI KEMBANGAN, SELANGOR,address
1,125,275,318,275,318,297,125,297,(TESCO PUTRA NILAI),other
1,177,295,266,295,266,317,177,317,-INVOICE-,other
1,12,337,402,337,402,362,12,362,KILAT AUTO ECO WASH & SHINE ES1000 1L,other
1,20,360,160,360,160,383,20,383,WA45 /2A - 12,other
1,16,382,156,382,156,402,16,402,9555916500133,other

For the training phase, only the latter one is required. The codes directly parse the annotations and generate the corresponding BIO tags.

@kerberosargos
Copy link
Author

I will try. Thank you very much for your support and effort. Have nice days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants