Speedup line_offset property #1392

eanorige · 2023-10-20T16:21:28Z

When using mkdocs, I noticed huge runtimes and tracked these down to HTMLExtractor.line_offset taking all the time. Before this patch, a run of mkdocs could take 500+ seconds, with this patch, it takes ~10 seconds.

Flamegraph before optimization:

Flamegraph after optimization:

Note: It may be the case that the full cache of start-positions of all lines isn't needed, and just a single entry is sufficient; I didn't explore this optimization.

* Replace dynamic regex with string find operation * Add cache of where each line starts so we don't have quadratic behavior identifying line numbers when importing large chunks of html

Forgot to remove old implementation

eanorige · 2023-10-23T05:30:29Z

Thanks for triggering CI. Based on its feedback, I have

added the auto-requested changelog entry
removed the extra blank line accidentally introduced
reworded a comment.

In rewording this comment, I realize that there is a slight change in behavior of this function when self.lineno is past the last line; the old implementation would return the position of the last '\n' in the input, while this implementation returns the length of the input (i.e. EOF). If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

facelessuser · 2023-10-23T12:33:25Z

@waylan I am not as familiar with the HTML parsing that you implemented, can a buffer in this situation change on you invalidating the cached offset?

If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

Generally, I would say that to have a better chance of getting merged quickly, backwards compatible would be better. When something backwards incompatible is introduced, we would need to be even more cautious to ensure this doesn't inadvertently break some other behavior that is going to cause us more time to debug later.

While the speedup is certainly nice, I would definitely want to ensure this hasn't introduced some unexpected parsing behavior that is going to break 3rd party plugins that aren't immediately obvious.

facelessuser · 2023-10-23T12:37:51Z

I guess if the change could be somehow identified as inadvertent behavior, then it would make sense to not retain backwards compatibility.

waylan · 2023-10-23T14:00:50Z

My memory is very hazy about this. That said, the code that this PR replaces is discussed in #1066 and/or #1068 and was committed in #1069. It appears that we added tests to cover the various edge cases.

the old implementation would return the position of the last '\n' in the input, while this implementation returns the length of the input (i.e. EOF). If the exact backwards compatible behavior is needed, lf_pos = len(self.rawdata) should be changed to = self.rawdata.rfind('\n'). Let me know what you think.

Yes, we want the exact same behavior. It is possible that a file may not end with a newline and that would have been taken into consideration with the existing implementation. Therefore, we would need the method to always return the position of a \n exactly. Returning any other position would be an error. Perhaps we need to add a test for that.

waylan · 2023-10-23T14:50:26Z

can a buffer in this situation change on you invalidating the cached offset?

In short yes. See this comment (and the few that follow) for an explanation. As a reminder, we are using the standard library HTML parser to extract HTML from non-HTML and need to hack it to make it work with any non-HTML which contains angle brackets.

Therefore, we would need the method to always return the position of a \n exactly.

Actually, what I should have said is that the method always must return the beginning of a line. EOF is only the beginning of a (empty) line if the buffer ends with a newline. If it does not end with a newline, the EOF is not the beginning of a line and it would be an error to return that position. In fact, looking at the example given in the above linked comment, it is possible that returning EOF could fail to address the purpose of this hack altogether.

facelessuser · 2023-10-23T14:56:53Z

In short yes. See #1066 (comment) (and the few that follow) for an explanation. As a reminder, we are using the standard library HTML parser to extract HTML from non-HTML and need to hack it to make it work with any non-HTML which contains angle brackets.

This poses a problem then. If the cache can be invalidated without us knowing, then caching is likely a bad idea.

waylan · 2023-10-23T18:13:47Z

Sorry, I think I misread your question. No, the buffer doesn't change. However, the parser jumps too far ahead and we need to back it up. What is being cached is the last position. The new position we back it up to should always be after the last cached position. And now I'm wondering if I am wrong about what to do if no more newlines are found.

eanorige · 2023-10-23T18:54:57Z

Actually, what I should have said is that the method always must return the beginning of a line.

The old behavior is actually a bit more complex than that. Here's a sketch of the current (before my patch) implementation:

When there's no newlines in the input file or lineno is <= 1, return 0
Else if lineno is a valid line number, return the position of the first character *after* the newline
Else return the position of the last newline.

I suspect that there's a bug in the existing code and that we should return rfind('\n')+1 in the last case to be consistent with the first two.

ed: remove caching invalidation discussion

waylan · 2023-10-24T19:55:28Z

I suspect that there's a bug in the existing code and that we should return rfind('\n')+1 in the last case to be consistent with the first two.

What is curious is why no tests are failing. Unless the character at the +1 position is insignificant whitespace or something. I wondered if maybe the tests weren't triggering that code path, but the coverage report says they are. I guess I need to take a closer look at the tests. One or two might need to be tweaked to be sure it fails if we are off by one character.

waylan · 2023-10-26T18:21:54Z

I just took another look at this. Specifically, the behavior when no newlines are found. I confirmed that we have no tests for this edge case. In fact, the existing code has a # pragma: no cover comment on that code block. If I remove the comment, that line shows up in the coverage report as missing.

I don't recall if I added the comment because I couldn't come up with a scenario which triggered that code path, or because I was focusing on other things and expected to come back to it later. In any event, we should probably try to come up with a test for this. If we do, then that should dictate how it should behave. If not, then I'm not going to be too concerned about it.

waylan · 2023-10-26T18:45:45Z

So, the concern (how the HTML parser could get into the state where the line number is greater than the total number of lines) is described in this comment. However, among various similar tests, the following test exists:

markdown/tests/test_syntax/blocks/test_html_blocks.py

Lines 1119 to 1139 in 4f0b91a

    
               def test_raw_processing_instruction_code_span(self): 
        
                   self.assertMarkdownRenders( 
        
                       self.dedent( 
        
                           """ 
        
                           `<?php` 
        
                           <div> 
        
                           foo 
        
                           </div> 
        
                           """ 
        
                       ), 
        
                       self.dedent( 
        
                           """ 
        
                           <p><code>&lt;?php</code></p> 
        
                           <div> 
        
                           foo 
        
                           </div> 
        
                           """ 
        
                       ) 
        
                   )

I just tried many other variations and I cannot get that condition. I expect that it would take a very unusual edge case (with very poorly formed HTML) to ever get that condition. Therefore, I'm not going to sweet it.

eanorige added 2 commits October 20, 2023 09:10

Speedup line_offset property

4a2a7dc

* Replace dynamic regex with string find operation * Add cache of where each line starts so we don't have quadratic behavior identifying line numbers when importing large chunks of html

Update htmlparser.py

70b30e7

Forgot to remove old implementation

waylan added the needs-review Needs to be reviewed and/or approved. label Oct 22, 2023

eanorige added 3 commits October 22, 2023 22:04

Add changelog entry

e300ce4

Remove extra blank line causing CI failure.

f32969a

Improve comment to only use words

7d403f2

waylan approved these changes Oct 26, 2023

View reviewed changes

waylan merged commit 524e4da into Python-Markdown:master Oct 26, 2023
16 checks passed

waylan mentioned this pull request Oct 30, 2023

Improve type annotations (add more and fix wrong ones) #1394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup line_offset property #1392

Speedup line_offset property #1392

eanorige commented Oct 20, 2023 •

edited

Loading

eanorige commented Oct 23, 2023

facelessuser commented Oct 23, 2023

facelessuser commented Oct 23, 2023

waylan commented Oct 23, 2023

waylan commented Oct 23, 2023

facelessuser commented Oct 23, 2023

waylan commented Oct 23, 2023

eanorige commented Oct 23, 2023 •

edited

Loading

waylan commented Oct 24, 2023

waylan commented Oct 26, 2023

waylan commented Oct 26, 2023

Speedup line_offset property #1392

Speedup line_offset property #1392

Conversation

eanorige commented Oct 20, 2023 • edited Loading

eanorige commented Oct 23, 2023

facelessuser commented Oct 23, 2023

facelessuser commented Oct 23, 2023

waylan commented Oct 23, 2023

waylan commented Oct 23, 2023

facelessuser commented Oct 23, 2023

waylan commented Oct 23, 2023

eanorige commented Oct 23, 2023 • edited Loading

waylan commented Oct 24, 2023

waylan commented Oct 26, 2023

waylan commented Oct 26, 2023

eanorige commented Oct 20, 2023 •

edited

Loading

eanorige commented Oct 23, 2023 •

edited

Loading