Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding an Arabic vocab file #514

Merged
merged 5 commits into from
Oct 4, 2021
Merged

Conversation

mzeidhassan
Copy link
Contributor

@mzeidhassan mzeidhassan commented Sep 29, 2021

I have added Arabic alphabet plus some Farsi alphabet because they may exist in many Arabic texts. I also added "Hindi" numbers and Arabic diacritics. Arabic uses connected shaping when it comes to characters, so characters are not isolated as in Latin-based languages. Hope this won't be an issue. Please let me know if you have any questions.

@charlesmindee charlesmindee self-assigned this Sep 30, 2021
@charlesmindee charlesmindee added type: enhancement Improvement module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition labels Sep 30, 2021
Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR!

Could you please move this to doctr/datasets/vocabs.py ?

And as you mentioned, your are aggregating Hindi + Arabic + Farsi characters if I understood well, would it be possible to split those entries in the dictionary right there (below ancient greek):

VOCABS: Dict[str, str] = {
    'digits': string.digits,
    'ascii_letters': string.ascii_letters,
    'punctuation': string.punctuation,
    'currency': '£€¥¢฿',
    'ancient_greek': 'αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ',
}

For instance: 'hindi_letters', 'arabic_diacritics', etc... And try to be as exhaustive as possibe for each sub class. Then you can add below the dictionnary your full vocab (you can follow the example of european vocabs):

VOCABS['arabic'] = VOCABS['hindi_letters'] + VOCABS['farsi'] + VOCABS['arabic_diacritics'] + ...

Thanks for that ! 🙏

@mzeidhassan
Copy link
Contributor Author

mzeidhassan commented Sep 30, 2021

Thanks @charlesmindee for your help and guidance. I appreciate it. I am new to PR pushing, so I hope I am doing it right.

Please find the changes I made here
3faad13

Instead of creating a new item "arabic_numbers", I am using "VOCABS['digits']".

Please let me know if there is anything else that I need to do at my end.

Thanks,
Mohamed

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the edits! Just a small modification and it looks like we're good to merge!

arabic_vocab Outdated Show resolved Hide resolved
Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the edit, you only need to split the last line in 2 lines to pass the flake8 test (code style enforcement, each line must be < 120) and we are good to merge (all the other tests are OK)!

@charlesmindee
Copy link
Collaborator

closes #490

@codecov
Copy link

codecov bot commented Oct 1, 2021

Codecov Report

Merging #514 (9fe9594) into main (14b376e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #514   +/-   ##
=======================================
  Coverage   95.38%   95.38%           
=======================================
  Files         109      109           
  Lines        4183     4184    +1     
=======================================
+ Hits         3990     3991    +1     
  Misses        193      193           
Flag Coverage Δ
unittests 95.38% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/vocabs.py 100.00% <100.00%> (ø)
doctr/models/recognition/master/tensorflow.py 96.47% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14b376e...9fe9594. Read the comment docs.

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, I added some suggestions to fix flake8 and improve naming 👌

doctr/datasets/vocabs.py Outdated Show resolved Hide resolved
doctr/datasets/vocabs.py Outdated Show resolved Hide resolved
doctr/datasets/vocabs.py Outdated Show resolved Hide resolved
doctr/datasets/vocabs.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@charlesmindee charlesmindee requested review from fg-mindee and removed request for fg-mindee October 4, 2021 08:20
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for edits, looks good to me!

@fg-mindee fg-mindee merged commit 40678ac into mindee:main Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants