Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite Loop Issue #18

Closed
ninoseki opened this issue Mar 8, 2024 · 4 comments
Closed

Infinite Loop Issue #18

ninoseki opened this issue Mar 8, 2024 · 4 comments

Comments

@ninoseki
Copy link

ninoseki commented Mar 8, 2024

Hello, first of all, thank you for creating this library.

I found an infinite loop issue. So let me report it.

When taking this file as an input,

Rtf_Parser("/path/to/file").parse_file()

starts infinite looping like

Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
Missing Control Word
...
@fleetingbytes
Copy link
Owner

Hello @ninoseki, thank you for using rtfparse and for this bug report. Reading the log file in ~/rtfparse/rtfparse.debug.log I found that the error occurs while reading the control word htmlrtf on line 622 (Rich Text Format (RTF) Specification, Version 1.9.1 defines Control Word on page 7). The source file on line 622 has the byte sequence {\htmlrtf0Start with $200 credit. This is valid RTF. Rtfparse cannot find the end of this control word here.

The control word "htmlrtf" has a one-digit parameter "0", and the parameter is delimited by a character other that an ASCII digit, here "S". This "S" marks that the control word has ended at the previous byte. This is not correctly recognized by rtfparse. I will fix this.

Meanwhile, a workaround for your document would be to add a space between the 0 and the S: {\htmlrtf0 Start with $200 credit.

@fleetingbytes
Copy link
Owner

Note to self:
Potential fix could be:
in re_patterns.py:

nothing = named_regex_group("nothing", group(rb""))
...
delimiter = named_regex_group("delimiter", rb"|".join((space, newline, other, nothing, rb"$")))

Needs testing.

fleetingbytes added a commit that referenced this issue Mar 11, 2024
fleetingbytes added a commit that referenced this issue Mar 11, 2024
@fleetingbytes
Copy link
Owner

@ninoseki Try the new rtfparse 0.9.0 (it's on PyPI). The issue should be fixed there. If you used rtfparse programmatically, please note that some things in the API were renamed. If you only executed rtfparse from the CLI, not much has changed, except that it uses --decapsulate-html instead of --de-encapsulate-html.

@ninoseki
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants