The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc #1

burhanharoon · 2022-03-21T16:54:42Z

Right now the tokenize() function is splitting whenever a ' . ' character is found. Most of the time it's a correct approach to split a fine into sentences but sometimes the abbreviation like Dr., Mr., Mrs, etc. appear in a middle of a sentence and hence splits the sentence right there. I want to enhance the regex to not to spit the sentences on abbreviations.

DaudAhmad0303 · 2022-03-29T05:41:22Z

Please assign this issue to me

burhanharoon · 2022-10-18T13:55:39Z

@DaudAhmad0303 Do you still want to work on it?

burhanharoon added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Mar 21, 2022

burhanharoon added the hacktoberfest label Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc #1

The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc #1

burhanharoon commented Mar 21, 2022

DaudAhmad0303 commented Mar 29, 2022

burhanharoon commented Oct 18, 2022

The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc #1

The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc #1

Comments

burhanharoon commented Mar 21, 2022

DaudAhmad0303 commented Mar 29, 2022

burhanharoon commented Oct 18, 2022