Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess text: first word in custom stopwords list is ignored #1028

Closed
wvdvegte opened this issue Dec 6, 2023 · 9 comments · Fixed by #1072
Closed

Preprocess text: first word in custom stopwords list is ignored #1028

wvdvegte opened this issue Dec 6, 2023 · 9 comments · Fixed by #1072

Comments

@wvdvegte
Copy link

wvdvegte commented Dec 6, 2023

Describe the bug
In custom .txt (UTF-8) stopwords files, the first word is ignored as a stopword by Preprocess Text, i.e., it is not filtered out.

To Reproduce
Create a custom stopwords .txt file in UTF-8 encoding (in my case, I used MS Word), consisting of words separated by returns, and load it in Preprocess text. The first word will not be filtered out but the rest will. Leaving the first line empty solves the problem, but it's not the obvious thing to do.

Expected behavior
All custom stopwords should be filtered out.

Orange version:
3.36.2 (I don't know if it's the native Silicon version or the Intel version)

Text add-on version:
1.15.0

Operating system:
Mac OS 14.1.2 (23B92)

@ajdapretnar
Copy link
Collaborator

This is an editor issue. When I use Sublime text, the file contains word1\nword2. When I use TextEdit (OSX), the file contains '{\\rtf1\\ansi\\ansicpg1252\\cocoartf2636\n\\cocoatextscaling0\\cocoaplatform0{\\fonttbl\\f0\\fswiss\\fcharset0 Helvetica;}\n{\\colortbl;\\red255\\green255\\blue255;}\n{\\*\\expandedcolortbl;;}\n\\paperw11900\\paperh16840\\margl1440\\margr1440\\vieww11520\\viewh8400\\viewkind0\n\\pard\\tx566\\tx1133\\tx1700\\tx2267\\tx2834\\tx3401\\tx3968\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural\\partightenfactor0\n\n\\f0\\fs24 \\cf0 of\\\nsystem}'.
I think MS Word does the same. You could test with:

with open("path/to/file.txt") as f:
    file = f.read()
file

See what you get.

@ajdapretnar
Copy link
Collaborator

@markotoplak Is there a way we could sanitize this internally?

@janezd
Copy link
Contributor

janezd commented Jul 9, 2024

@ajdapretnar, I guess you are saving text as rich text format (rtf), not plain text.

@wvdvegte probably has a different problem.

@ajdapretnar
Copy link
Collaborator

I thought the reason for not considering the first row for filtering is because in rtf, additional parameters get treated as text. So instead of a plain "orange" one would get "{fancyparam:15}orange" and thus the word would not be filtered.

@wvdvegte
Copy link
Author

I was indeed referring to the use of plain text (TXT), not RTF.

@ajdapretnar
Copy link
Collaborator

@wvdvegte Could you perhaps send the stopword list? I cannot replicate the issue, so perhaps there's something about the file that is the problem. Thanks!

@wvdvegte
Copy link
Author

I didn't manage to dig up what I was working on when I reported on this in December 2023, but when I'm trying to reproduce the problem, I'm not getting any of the custom stopwords filtered out:
stopword filtering.zip

@ajdapretnar
Copy link
Collaborator

Thank you! Now I've finally managed to reproduce the issue.
As I've suspected, it's the editor. The string reads: '\ufeffpig\ncow\nchicken\nhorse\n'. The first character is a BOM, typical for Windows apparently. We can solve this by reading the file with encoding='utf-8-sig'.
Will prepare and test the fix.

@wvdvegte
Copy link
Author

Typical for Microsoft, perhaps? I created the text file using Word for Mac ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants