Improve _HTMLWordTruncator. Fix #2863 #2864

ImBearChild · 2021-04-04T14:10:59Z

Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.

Pull Request Checklist

Resolves: #2863

Ensured tests pass and (if applicable) updated functional test output
Conformed to code style guidelines by running appropriate linting tools
Added tests for changed code (Not necessary)
Updated documentation for changed code (Not necessary)

avaris · 2021-06-12T02:42:01Z

It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further.

ImBearChild · 2021-06-17T06:57:35Z

It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further.

Sorry to bother you, but have you found the reason now? 🤣

avaris · 2021-09-24T19:58:10Z

@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC block:

SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]",

especially the first part. It is too permissive. It includes punctuations, braces ([ ( { } ) ]), mathematical signs and as well as a lot of control characters. Those are counted as words in some cases and changes the test cases. I think we need to divide it a bit more. Based on the codepoints and without making it way too complicated, I think this would be more appropriate:

SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
#         ASCII  |Extended Latin | Cyrillic

with this test cases remain same.

ImBearChild · 2021-09-25T03:19:33Z

@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC block:
SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]",
especially the first part. It is too permissive. It includes punctuations, braces ([ ( { } ) ]), mathematical signs and as well as a lot of control characters. Those are counted as words in some cases and changes the test cases. I think we need to divide it a bit more. Based on the codepoints and without making it way too complicated, I think this would be more appropriate:
SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
#         ASCII  |Extended Latin | Cyrillic
with this test cases remain same.

Well, should I close this pull request and open a new one with your improved code? And I think I have to check DBC
part to make sure there's no punctuation included. I have no idea of how to modify a open pull request.

justinmayer · 2021-09-25T06:20:20Z

Hi NianQing. 👋 Thank you for asking first, instead of just closing and creating a new PR. Many folks do that without asking first, which is a shame because it is totally unnecessary. So thanks again for asking!

Pull requests are designed to be modified, so it is fairly straightforward to make follow-up changes. Any new commits that are pushed to your forked repo's better-word-count branch will appear in this pull request. The flow looks something like this:

Use the same feature branch (better-word-count) to make changes to code, tests, documentation, etc.
Use git add to select and stage the changed files.
Make a new commit via git commit.
Push the new commit to your forked repo's better-word-count branch, via the same manner as you did the first time.

Following are some related links from our documentation that you may find helpful:

I hope this was helpful. How else might I assist you?

ImBearChild · 2021-09-25T08:14:33Z

Thank you for your instruction! And now it's time to check these new code.😄

ImBearChild · 2021-09-27T16:16:43Z

That's strange, I've run all the tests on my machine and no failure happen. 🤔

avaris · 2021-09-27T23:25:56Z

There is currently an issue with feedgenerator and tests are likely failing because of that. I don't think your changes are the cause. Speaking of tests, can you add one or two tests for truncate_html_words with some CJK text here? Just to make sure we cover this for any future changes.

ImBearChild · 2021-09-28T08:33:30Z

There is currently an issue with feedgenerator and tests are likely failing because of that.

Oh, I have found where the problem is. I used wrong regular expressions like"([\u20000–\u2A6DF])|" instead of "([\U00020000-\U0002A6DF])|". The wrong ones will actually behave like ([\u2000–\u2A6D])|, so extra punctuation is included. Now I fixed this and added some test case.

avaris

Looks good. Thank you!

Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.

ImBearChild · 2021-09-29T06:48:29Z

Looks good. Thank you

Sorry, I've just found another typo in regular expression. I used – instead of - in one regular expression. And now it's fixed. Sorry for that...

justinmayer · 2021-09-29T10:44:16Z

Merged manually via 22192c1. Many thanks to @ImBearChild for the nice enhancement and to @avaris for reviewing. ✨

justinmayer requested a review from avaris June 7, 2021 16:21

ImBearChild force-pushed the better-word-count branch 2 times, most recently from 45aacfd to 963c552 Compare September 25, 2021 08:10

ImBearChild force-pushed the better-word-count branch 3 times, most recently from 21ef383 to 3af8382 Compare September 25, 2021 08:21

ImBearChild force-pushed the better-word-count branch from 3af8382 to 2417f85 Compare September 28, 2021 07:59

avaris approved these changes Sep 28, 2021

View reviewed changes

Improve _HTMLWordTruncator. Fix getpelican#2863

73aca20

Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.

ImBearChild force-pushed the better-word-count branch from 2417f85 to 73aca20 Compare September 29, 2021 06:46

justinmayer closed this Sep 29, 2021

justinmayer added the manual merge label Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve _HTMLWordTruncator. Fix #2863 #2864

Improve _HTMLWordTruncator. Fix #2863 #2864

ImBearChild commented Apr 4, 2021 •

edited

Loading

avaris commented Jun 12, 2021

ImBearChild commented Jun 17, 2021

avaris commented Sep 24, 2021 •

edited

Loading

ImBearChild commented Sep 25, 2021 •

edited

Loading

justinmayer commented Sep 25, 2021

ImBearChild commented Sep 25, 2021

ImBearChild commented Sep 27, 2021

avaris commented Sep 27, 2021

ImBearChild commented Sep 28, 2021

avaris left a comment

ImBearChild commented Sep 29, 2021

justinmayer commented Sep 29, 2021

Improve _HTMLWordTruncator. Fix #2863 #2864

Improve _HTMLWordTruncator. Fix #2863 #2864

Conversation

ImBearChild commented Apr 4, 2021 • edited Loading

Pull Request Checklist

avaris commented Jun 12, 2021

ImBearChild commented Jun 17, 2021

avaris commented Sep 24, 2021 • edited Loading

ImBearChild commented Sep 25, 2021 • edited Loading

justinmayer commented Sep 25, 2021

ImBearChild commented Sep 25, 2021

ImBearChild commented Sep 27, 2021

avaris commented Sep 27, 2021

ImBearChild commented Sep 28, 2021

avaris left a comment

Choose a reason for hiding this comment

ImBearChild commented Sep 29, 2021

justinmayer commented Sep 29, 2021

ImBearChild commented Apr 4, 2021 •

edited

Loading

avaris commented Sep 24, 2021 •

edited

Loading

ImBearChild commented Sep 25, 2021 •

edited

Loading