Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "regex" library from PyPI to build GLOBAL_SENTENCE_TERMINATORS regex #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

divec
Copy link

@divec divec commented Oct 22, 2023

Fixes #3

@divec
Copy link
Author

divec commented Oct 22, 2023

This is analogous to sentencex-js/pull/3 — both remove the terminators codepoint list module, and instead use an external library that tracks the Unicode Character Database.

@santhoshtr
Copy link
Member

Hi @divec,
Sorry for delay in review. Holidays..

All the tests are passing with this pull request, but I noticed a performance regression. Running accuracy.py in the benchmarks folder of this repo gives the following measures

  • Using re package and terminatorsy.py(master version) : 0.75 seconds (Avg over 100 runs )
  • This pull request that uses regex package : 1.83 seconds (Avg over 100 runs )

So that is more than two times the master version. Since the original issue is only with sentence terminator regex and not with other regex usage, I tried to apply a minimal patch:

diff --git a/sentencex/base.py b/sentencex/base.py
index a2218e6..e2c0bde 100644
--- a/sentencex/base.py
+++ b/sentencex/base.py
@@ -1,4 +1,5 @@
 import re
+import regex
 from typing import Dict, Iterator, Tuple
 
 from .terminators import GLOBAL_SENTENCE_TERMINATORS
@@ -44,7 +45,7 @@ class Language(object, metaclass=Languages):
 
     language = "base"
     abbreviations: set = set()
-    GLOBAL_SENTENCE_BOUNDARY_REGEX = re.compile(r"[%s]+" % "".join(GLOBAL_SENTENCE_TERMINATORS))
+    GLOBAL_SENTENCE_BOUNDARY_REGEX = regex.compile(r"\p{Sentence_Terminal}+")
     quotes_regx_str = r"|".join([f"{left}(\n|.)*?{right}" for left, right in quote_pairs.items()])
     quotes_regex = re.compile(r"%s+" % quotes_regx_str)
     parens_regex = re.compile(r"([\((<{\[])(?:\\\1|.)*?[\)\]})]")

Then I got 1.37 seconds- which is better, but still a regression in performance.

Maintaining an in-library collection of these characters helped us to achieve a javascript counterpart. So the zero dependency for this library was an intentional decision. What if we fix the amaharic issue and the comment issue you pointed out and keep the zero dependency pattern as such?

@santhoshtr
Copy link
Member

The terminators collection was originally sourced from wikinlp tools repo. I re-ran unichars program and I don't see these Amharic/Ethiopic chars in it. That must be a descrepancy in the list.

https://gist.github.com/santhoshtr/af560fde46b7ee56ed14fd3f6614de01

However, sometimes we need additional characters. For example, horizontal ellipsis is not in the list.

@divec
Copy link
Author

divec commented Nov 1, 2023

Thanks @santhoshtr. If we maintain the list of terminators within sentencex, then we probably also need to maintain the list of close punctuation and spaces in order to expose boundary information:

Value Represents
SATerm (STerm | ATerm)
STerm Sentence_Terminal = Yes
ATerm U+002E ( . ) FULL STOP
U+2024 ( ․ ) ONE DOT LEADER
U+FE52 ( ﹒ ) SMALL FULL STOP
U+FF0E ( . ) FULLWIDTH FULL STOP
Close General_Category = Open_Punctuation, or
General_Category = Close_Punctuation, or
Line_Break = Quotation
and not U+05F3 ( ׳ ) HEBREW PUNCTUATION GERESH
and ATerm = No
and STerm = No
Sp U+0085 NEXT LINE (NEL)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

The master copy of these lists is available in https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/SentenceBreakProperty.txt , which we pull into UnicodeJS using an extraction script that might be useful to borrow from, if you decide that maintaining these lists within sentencex is worthwhile for the performance boost.

@divec
Copy link
Author

divec commented Nov 2, 2023

The terminators collection was originally sourced from wikinlp tools repo. I re-ran unichars program and I don't see these Amharic/Ethiopic chars in it. That must be a descrepancy in the list.

https://gist.github.com/santhoshtr/af560fde46b7ee56ed14fd3f6614de01

However, sometimes we need additional characters. For example, horizontal ellipsis is not in the list.

Ah ok, understood. I wonder if the wikinlp version got partially corrupted due to encoding errors? The new version in the commit you added looks a lot less strange!

I note the new code in the commit adds U+3002 IDEOGRAPHIC FULL STOP manually; a related character that seems to be missing is U+FF61 HALFWIDTH IDEOGRAPHIC FULL STOP. I wonder why these are missing from the perl script output? Maybe they omit ideographic characters to save memory or something?

U+2026 HORIZONTAL ELLIPSIS is not actually listed as STerm in the Unicode Character Database. Did we make a heuristic decision to treat it as a sentence terminator nevertheless? If we really want to include it, then the manual entry should actually use "\u2026" in the string (right now the string is equal to "\u002E\u002E\u002E" i.e. three Basic Latin full stops).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

terminators.py contains codepoints which are not sentence terminators
2 participants