Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing ‘* * * * * * … a’ takes quadratic time #284

Closed
andersk opened this issue Mar 11, 2019 · 21 comments
Closed

Parsing ‘* * * * * * … a’ takes quadratic time #284

andersk opened this issue Mar 11, 2019 · 21 comments

Comments

@andersk
Copy link

andersk commented Mar 11, 2019

$ python -c 'print("* "*10000 + "a")' | time cmark > /dev/null
1.21user 0.00system 0:01.23elapsed 98%CPU (0avgtext+0avgdata 6048maxresident)k
0inputs+0outputs (0major+1188minor)pagefaults 0swaps
$ python -c 'print("* "*20000 + "a")' | time cmark > /dev/null
7.55user 0.00system 0:07.59elapsed 99%CPU (0avgtext+0avgdata 9968maxresident)k
0inputs+0outputs (0major+2245minor)pagefaults 0swaps
$ python -c 'print("* "*40000 + "a")' | time cmark > /dev/null
41.23user 0.01system 0:41.44elapsed 99%CPU (0avgtext+0avgdata 18848maxresident)k
0inputs+0outputs (0major+4410minor)pagefaults 0swaps

Related: jgm/commonmark-hs#2, mity/md4c#66.

@mity
Copy link
Contributor

mity commented Mar 11, 2019

Got the culprit in MD4C. I bet it will be the same in Cmark:

The line is scanned as thematic break until the final a. When the thematic break fails, it interprets the 1st * as a list item mark, and tries to see the rest of the line as thematic break nested in the list item, with the same result.

@jgm
Copy link
Member

jgm commented Mar 11, 2019

@mity good tip. Here's the difference it makes if we remove the check for thematic breaks entirely:

n with thematic breaks without thematic breaks
10k 0.58 0.04
20k 3.71 0.17
40k 24.7 0.65

@jgm
Copy link
Member

jgm commented Mar 11, 2019

The question is how to fix this. (@mity, what did you do?)
After a list start is parsed, there's always a question whether the remainder is a thematic break.
You could have something like

- 1. * 1. * 1. * * 1. * * *

I suppose that, before parsing any list items, one could parse from the end of the line to determine if it might end with a thematic break, and store that information in state. That would help with cases that end in a, like the original one, but maybe not with cases like the above, where it actually does end with a thematic break.

@mity
Copy link
Contributor

mity commented Mar 11, 2019

I fixed by remembering the offset of character (KILLER_OFF) which caused the thematic break to fail. In subsequent tests for thematic break, if CURRENT_OFF < KILLER_OFF, I know it cannot be thematic break because the breaking character is still there.

@mity
Copy link
Contributor

mity commented Mar 11, 2019

Note < and not <= because it saves crazy cases like e.g.:

 * * * - - -

jgm added a commit that referenced this issue Mar 17, 2019
Keep track of the last position where a thematic break
failed to match on a line, to avoid rescanning unnecessarily.

See #284.
@jgm
Copy link
Member

jgm commented Mar 17, 2019

@mity I tried to implement your strategy in the issue284 branch.
I can confirm with printf that it is only running the thematic break scanner once per line. But this doesn't seem to affect the timings.

Is it possible that something else is to blame, perhaps in the HTML writer for nested lists?
(EDIT: It doesn't seem to be the writer; I can comment out print_document without much effect.)

@jgm
Copy link
Member

jgm commented Mar 17, 2019

I'm seeing only a slight performance difference in this case with the custom thematic break scanner (< 10%, for a pathological input that should emphasize the difference).

@jgm
Copy link
Member

jgm commented Mar 17, 2019

As noted in the linked issue, I suspect the culprit is actually finalize.

@jgm
Copy link
Member

jgm commented Mar 17, 2019

Changing while (item) in finalize to while (false && item) seems to make the quadratic behavior go away.

@jgm
Copy link
Member

jgm commented Mar 17, 2019

The loop in ends_with_blank_line (called by finalize for lists) gets called n^2 times for our sample input. This must be the cause of the quadratic time behavior.

@jgm
Copy link
Member

jgm commented Mar 17, 2019

We should be able to memoize this: once a list item has been checked to see if it ends in a blank line, we shouldn't need to recurse into its subitems again. This should simply be recorded in the node.

jgm added a commit that referenced this issue Mar 17, 2019
to avoid unnecessary repetition.  Once we settle
whether a list item ends in a blank line, we don't
need to revisit this in considering parent list items.

See #284.
@jgm
Copy link
Member

jgm commented Mar 17, 2019

Hm, that was a good theory, but it doesn't seem to be the issue here: we still have quadratic behavior even after fixing ends_with_blank_line.

@mity
Copy link
Contributor

mity commented Mar 17, 2019

I do not know enough about cmark internals to comment. But I have just tried this in the current master/HEAD:

$ echo ' * * * - - -' | ./src/cmark
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>

While in MD4C:

$ echo ' * * * - - -' | md2html
<ul>
<li><ul>
<li><ul>
<li><hr>
</li>
</ul>
</li>
</ul>
</li>
</ul>

Which means cmark failed to recognize there is thematic break nested in the list.

@mity
Copy link
Contributor

mity commented Mar 17, 2019

So likely there is more then one bug.

@mity
Copy link
Contributor

mity commented Mar 17, 2019

So as I see it:

  1. cmark fails to recognize the thematic break nested in the list.
  2. if you fix it, then you need my fix for it to not make it O(n^2).
  3. there is yet another O(n^2) bug, perhaps in finalize(), whatever it does.

@mity
Copy link
Contributor

mity commented Mar 17, 2019

Yes that point 3 is ends_with_blank_line() called from finalize().

I guess, it is to detect loose versus tight list, right? In MD4C this is done on the fly during line analysis. I maintain a stack of currently "opened" container blocks. Every list is initially marked as tight one. When I encounter a blank line, I simply consult the stack and if its top is a list, then I set the flag making it a loose one.

EDIT: It is little bit more complicated: Blank line sets a helper flag. Non-blank line which still belongs to the same list does the check whether the flag is set. This distinguishes blank lines inside/between list items versus after the list.

See https://github.com/mity/md4c/blob/master/md4c/md4c.c#L5917

@jgm
Copy link
Member

jgm commented Mar 17, 2019

I found the issue I was chasing earlier. Fix soon.

But, about your case: why isn't cmark's output correct? - can be a bullet list marker too.

@mity
Copy link
Contributor

mity commented Mar 17, 2019

Consistency: The priority of the interpretation should be the same inside the list as on the top level. So as long as this

 - - -

is thematic break, it should be thematic break when inside the list.

@jgm jgm closed this as completed in e5a65e0 Mar 17, 2019
jgm added a commit that referenced this issue Mar 17, 2019
Keep track of the last position where a thematic break
failed to match on a line, to avoid rescanning unnecessarily.

See #284.
@jgm
Copy link
Member

jgm commented Mar 17, 2019

OK, agreed. The code I just pushed solves this problem -- we recognize the inner hrule.

It also solves the performance problem, through a combination of smarter 'blank-in-list' detection and a hand-rolled thematic break scanner with your thematic break kill position idea. Parsing time is now linear with input size for the * * * * a case.

@mity
Copy link
Contributor

mity commented Mar 17, 2019

And moral conclusion for me: Do not make bets without looking into source files...

@jgm
Copy link
Member

jgm commented Mar 18, 2019

Your bet was right, though: It's just that it was only one of two problems.
I'm glad to have fixed two problems with one bug report!

talum pushed a commit to github/cmark-gfm that referenced this issue Sep 14, 2021
to avoid unnecessary repetition.  Once we settle
whether a list item ends in a blank line, we don't
need to revisit this in considering parent list items.

See commonmark#284.
talum referenced this issue in github/cmark-gfm Sep 14, 2021
Use this to avoid unnecessary recursion in ends_with_blank_line.

Closes #284.
talum pushed a commit to github/cmark-gfm that referenced this issue Sep 14, 2021
Keep track of the last position where a thematic break
failed to match on a line, to avoid rescanning unnecessarily.

See commonmark#284.
taku0 added a commit to taku0/cmark that referenced this issue Aug 17, 2023
This flag was introduced by
commonmark#284, but we will not need it
once we update `S_ends_with_blank_line` to not use resursion in the next
commit.
taku0 added a commit to taku0/cmark that referenced this issue Aug 19, 2023
This flag was introduced by
commonmark#284, but we will not need it
once we update `S_ends_with_blank_line` to not use resursion in the next
commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants