Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve _HTMLWordTruncator. Fix #2863 #2864

Closed
wants to merge 1 commit into from

Conversation

ImBearChild
Copy link
Contributor

@ImBearChild ImBearChild commented Apr 4, 2021

Use more than one unicode block in _word_regex, making word count function behave properly with CJK, cyrillic and more latin characters when generating summary.

Pull Request Checklist

Resolves: #2863

  • Ensured tests pass and (if applicable) updated functional test output
  • Conformed to code style guidelines by running appropriate linting tools
  • Added tests for changed code (Not necessary)
  • Updated documentation for changed code (Not necessary)

@justinmayer justinmayer requested a review from avaris June 7, 2021 16:21
@avaris
Copy link
Member

avaris commented Jun 12, 2021

It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further.

@ImBearChild
Copy link
Contributor Author

It would be nice to add a test case for this with non-latin text. Also, I'm a bit puzzled by the shortening of summaries in the current test cases. I don't see any apparent reason for that, but I need to look into it a bit further.

Sorry to bother you, but have you found the reason now? 🤣

@avaris
Copy link
Member

avaris commented Sep 24, 2021

@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC block:

SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]",

especially the first part. It is too permissive. It includes punctuations, braces ([ ( { } ) ]), mathematical signs and as well as a lot of control characters. Those are counted as words in some cases and changes the test cases. I think we need to divide it a bit more. Based on the codepoints and without making it way too complicated, I think this would be more appropriate:

SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
#         ASCII  |Extended Latin | Cyrillic

with this test cases remain same.

@ImBearChild
Copy link
Contributor Author

ImBearChild commented Sep 25, 2021

@ImBearChild, sorry I didn't really have the chance to look at it in detail before. Now that I have looked into it, I think the issue is the SBC block:

SBC=r"[\u0030-\u024f]|[\u0400-\u04FF]",

especially the first part. It is too permissive. It includes punctuations, braces ([ ( { } ) ]), mathematical signs and as well as a lot of control characters. Those are counted as words in some cases and changes the test cases. I think we need to divide it a bit more. Based on the codepoints and without making it way too complicated, I think this would be more appropriate:

SBC=r"[0-9a-zA-Z]|[\u00C0-\u024f]|[\u0400-\u04FF]"
#         ASCII  |Extended Latin | Cyrillic

with this test cases remain same.

Well, should I close this pull request and open a new one with your improved code? And I think I have to check DBC
part to make sure there's no punctuation included. I have no idea of how to modify a open pull request.

@justinmayer
Copy link
Member

Hi NianQing. 👋  Thank you for asking first, instead of just closing and creating a new PR. Many folks do that without asking first, which is a shame because it is totally unnecessary. So thanks again for asking!

Pull requests are designed to be modified, so it is fairly straightforward to make follow-up changes. Any new commits that are pushed to your forked repo's better-word-count branch will appear in this pull request. The flow looks something like this:

  1. Use the same feature branch (better-word-count) to make changes to code, tests, documentation, etc.
  2. Use git add to select and stage the changed files.
  3. Make a new commit via git commit.
  4. Push the new commit to your forked repo's better-word-count branch, via the same manner as you did the first time.

Following are some related links from our documentation that you may find helpful:

I hope this was helpful. How else might I assist you?

@ImBearChild ImBearChild force-pushed the better-word-count branch 2 times, most recently from 45aacfd to 963c552 Compare September 25, 2021 08:10
@ImBearChild
Copy link
Contributor Author

Thank you for your instruction! And now it's time to check these new code.😄

@ImBearChild ImBearChild force-pushed the better-word-count branch 3 times, most recently from 21ef383 to 3af8382 Compare September 25, 2021 08:21
@ImBearChild
Copy link
Contributor Author

That's strange, I've run all the tests on my machine and no failure happen. 🤔

@avaris
Copy link
Member

avaris commented Sep 27, 2021

There is currently an issue with feedgenerator and tests are likely failing because of that. I don't think your changes are the cause. Speaking of tests, can you add one or two tests for truncate_html_words with some CJK text here? Just to make sure we cover this for any future changes.

@ImBearChild
Copy link
Contributor Author

There is currently an issue with feedgenerator and tests are likely failing because of that.

Oh, I have found where the problem is. I used wrong regular expressions like"([\u20000–\u2A6DF])|" instead of "([\U00020000-\U0002A6DF])|". The wrong ones will actually behave like ([\u2000–\u2A6D])|, so extra punctuation is included. Now I fixed this and added some test case.

Copy link
Member

@avaris avaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you!

Use more than one unicode block in _word_regex, making word count
function behave properly with CJK, cyrillic and more latin characters
when generating summary.
@ImBearChild
Copy link
Contributor Author

Looks good. Thank you

Sorry, I've just found another typo in regular expression. I used instead of - in one regular expression. And now it's fixed. Sorry for that...

@justinmayer
Copy link
Member

Merged manually via 22192c1. Many thanks to @ImBearChild for the nice enhancement and to @avaris for reviewing. ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setting SUMMARY_MAX_LENGTH doesn't apply to CJK content
3 participants