Parsing ‘* * * * * * … a’ takes quadratic time #284

andersk · 2019-03-11T17:52:47Z

$ python -c 'print("* "*10000 + "a")' | time cmark > /dev/null
1.21user 0.00system 0:01.23elapsed 98%CPU (0avgtext+0avgdata 6048maxresident)k
0inputs+0outputs (0major+1188minor)pagefaults 0swaps
$ python -c 'print("* "*20000 + "a")' | time cmark > /dev/null
7.55user 0.00system 0:07.59elapsed 99%CPU (0avgtext+0avgdata 9968maxresident)k
0inputs+0outputs (0major+2245minor)pagefaults 0swaps
$ python -c 'print("* "*40000 + "a")' | time cmark > /dev/null
41.23user 0.01system 0:41.44elapsed 99%CPU (0avgtext+0avgdata 18848maxresident)k
0inputs+0outputs (0major+4410minor)pagefaults 0swaps

Related: jgm/commonmark-hs#2, mity/md4c#66.

mity · 2019-03-11T18:47:39Z

Got the culprit in MD4C. I bet it will be the same in Cmark:

The line is scanned as thematic break until the final a. When the thematic break fails, it interprets the 1st * as a list item mark, and tries to see the rest of the line as thematic break nested in the list item, with the same result.

jgm · 2019-03-11T20:00:41Z

@mity good tip. Here's the difference it makes if we remove the check for thematic breaks entirely:

n	with thematic breaks	without thematic breaks
10k	0.58	0.04
20k	3.71	0.17
40k	24.7	0.65

jgm · 2019-03-11T20:05:33Z

The question is how to fix this. (@mity, what did you do?)
After a list start is parsed, there's always a question whether the remainder is a thematic break.
You could have something like

- 1. * 1. * 1. * * 1. * * *

I suppose that, before parsing any list items, one could parse from the end of the line to determine if it might end with a thematic break, and store that information in state. That would help with cases that end in a, like the original one, but maybe not with cases like the above, where it actually does end with a thematic break.

mity · 2019-03-11T20:11:02Z

I fixed by remembering the offset of character (KILLER_OFF) which caused the thematic break to fail. In subsequent tests for thematic break, if CURRENT_OFF < KILLER_OFF, I know it cannot be thematic break because the breaking character is still there.

mity · 2019-03-11T20:11:56Z

Note < and not <= because it saves crazy cases like e.g.:

 * * * - - -

Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See #284.

jgm · 2019-03-17T18:25:09Z

@mity I tried to implement your strategy in the issue284 branch.
I can confirm with printf that it is only running the thematic break scanner once per line. But this doesn't seem to affect the timings.

Is it possible that something else is to blame, perhaps in the HTML writer for nested lists?
(EDIT: It doesn't seem to be the writer; I can comment out print_document without much effect.)

jgm · 2019-03-17T18:32:49Z

I'm seeing only a slight performance difference in this case with the custom thematic break scanner (< 10%, for a pathological input that should emphasize the difference).

jgm · 2019-03-17T18:33:47Z

As noted in the linked issue, I suspect the culprit is actually finalize.

jgm · 2019-03-17T18:39:51Z

Changing while (item) in finalize to while (false && item) seems to make the quadratic behavior go away.

jgm · 2019-03-17T18:52:07Z

The loop in ends_with_blank_line (called by finalize for lists) gets called n^2 times for our sample input. This must be the cause of the quadratic time behavior.

jgm · 2019-03-17T18:53:48Z

We should be able to memoize this: once a list item has been checked to see if it ends in a blank line, we shouldn't need to recurse into its subitems again. This should simply be recorded in the node.

to avoid unnecessary repetition. Once we settle whether a list item ends in a blank line, we don't need to revisit this in considering parent list items. See #284.

jgm · 2019-03-17T19:07:50Z

Hm, that was a good theory, but it doesn't seem to be the issue here: we still have quadratic behavior even after fixing ends_with_blank_line.

mity · 2019-03-17T19:08:13Z

I do not know enough about cmark internals to comment. But I have just tried this in the current master/HEAD:

$ echo ' * * * - - -' | ./src/cmark
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li>
<ul>
<li></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>

While in MD4C:

$ echo ' * * * - - -' | md2html
<ul>
<li><ul>
<li><ul>
<li><hr>
</li>
</ul>
</li>
</ul>
</li>
</ul>

Which means cmark failed to recognize there is thematic break nested in the list.

mity · 2019-03-17T19:10:01Z

So likely there is more then one bug.

mity · 2019-03-17T19:13:15Z

So as I see it:

cmark fails to recognize the thematic break nested in the list.
if you fix it, then you need my fix for it to not make it O(n^2).
there is yet another O(n^2) bug, perhaps in finalize(), whatever it does.

mity · 2019-03-17T19:30:24Z

Yes that point 3 is ends_with_blank_line() called from finalize().

I guess, it is to detect loose versus tight list, right? In MD4C this is done on the fly during line analysis. I maintain a stack of currently "opened" container blocks. Every list is initially marked as tight one. When I encounter a blank line, I simply consult the stack and if its top is a list, then I set the flag making it a loose one.

EDIT: It is little bit more complicated: Blank line sets a helper flag. Non-blank line which still belongs to the same list does the check whether the flag is set. This distinguishes blank lines inside/between list items versus after the list.

See https://github.com/mity/md4c/blob/master/md4c/md4c.c#L5917

jgm · 2019-03-17T19:59:25Z

I found the issue I was chasing earlier. Fix soon.

But, about your case: why isn't cmark's output correct? - can be a bullet list marker too.

mity · 2019-03-17T20:02:45Z

Consistency: The priority of the interpretation should be the same inside the list as on the top level. So as long as this

 - - -

is thematic break, it should be thematic break when inside the list.

Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See #284.

jgm · 2019-03-17T20:05:34Z

OK, agreed. The code I just pushed solves this problem -- we recognize the inner hrule.

It also solves the performance problem, through a combination of smarter 'blank-in-list' detection and a hand-rolled thematic break scanner with your thematic break kill position idea. Parsing time is now linear with input size for the * * * * a case.

mity · 2019-03-17T20:14:56Z

And moral conclusion for me: Do not make bets without looking into source files...

jgm · 2019-03-18T00:07:34Z

Your bet was right, though: It's just that it was only one of two problems.
I'm glad to have fixed two problems with one bug report!

This fixes a case like commonmark/cmark#284.

to avoid unnecessary repetition. Once we settle whether a list item ends in a blank line, we don't need to revisit this in considering parent list items. See commonmark#284.

Use this to avoid unnecessary recursion in ends_with_blank_line. Closes #284.

Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See commonmark#284.

This flag was introduced by commonmark#284, but we will not need it once we update `S_ends_with_blank_line` to not use resursion in the next commit.

andersk mentioned this issue Mar 11, 2019

Parsing ‘* * * * * * … a’ takes quadratic time mity/md4c#66

Closed

mity mentioned this issue Mar 11, 2019

Parsing ‘a***a***a***a***…’ takes quadratic time mity/md4c#63

Closed

jgm added a commit that referenced this issue Mar 17, 2019

Use hand-rolled scanner for thematic break.

a02f62f

Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See #284.

jgm added a commit that referenced this issue Mar 17, 2019

In ends_with_blank_line, call S_set_last_line_blank...

b4138c5

to avoid unnecessary repetition. Once we settle whether a list item ends in a blank line, we don't need to revisit this in considering parent list items. See #284.

jgm closed this as completed in e5a65e0 Mar 17, 2019

jgm added a commit that referenced this issue Mar 17, 2019

Use hand-rolled scanner for thematic break.

a308cf9

Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See #284.

jgm added a commit to commonmark/commonmark.js that referenced this issue Mar 18, 2019

More efficient checking for loose lists.

5181f25

This fixes a case like commonmark/cmark#284.

andersk mentioned this issue Mar 27, 2019

Parsing ‘* * * * * * … a’ takes quadratic time (master, new_algo) pulldown-cmark/pulldown-cmark#249

Closed

dependabot-preview bot mentioned this issue Aug 4, 2019

chore(deps-dev): update commonmark requirement from ^0.28.1 to ^0.29.0 jamesbyrne89/puppeteer#3

Closed

talum referenced this issue in github/cmark-gfm Sep 14, 2021

Add CMARK_NODE__LAST_LINE_CHECKED flag.

854010a

Use this to avoid unnecessary recursion in ends_with_blank_line. Closes #284.

talum pushed a commit to github/cmark-gfm that referenced this issue Sep 14, 2021

Use hand-rolled scanner for thematic break.

dcd36a4

Keep track of the last position where a thematic break failed to match on a line, to avoid rescanning unnecessarily. See commonmark#284.

taku0 added a commit to taku0/cmark that referenced this issue Aug 17, 2023

Remove CMARK_NODE__LAST_LINE_CHECKED flag

fb9375b

This flag was introduced by commonmark#284, but we will not need it once we update `S_ends_with_blank_line` to not use resursion in the next commit.

taku0 mentioned this issue Aug 17, 2023

Fix list tightness #479

Open

taku0 added a commit to taku0/cmark that referenced this issue Aug 19, 2023

Remove CMARK_NODE__LAST_LINE_CHECKED flag

1e73dea

This flag was introduced by commonmark#284, but we will not need it once we update `S_ends_with_blank_line` to not use resursion in the next commit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing ‘* * * * * * … a’ takes quadratic time #284

Parsing ‘* * * * * * … a’ takes quadratic time #284

andersk commented Mar 11, 2019 •

edited

Loading

mity commented Mar 11, 2019

jgm commented Mar 11, 2019

jgm commented Mar 11, 2019

mity commented Mar 11, 2019

mity commented Mar 11, 2019

jgm commented Mar 17, 2019 •

edited

Loading

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

mity commented Mar 17, 2019 •

edited

Loading

mity commented Mar 17, 2019

mity commented Mar 17, 2019

mity commented Mar 17, 2019 •

edited

Loading

jgm commented Mar 17, 2019

mity commented Mar 17, 2019

jgm commented Mar 17, 2019

mity commented Mar 17, 2019

jgm commented Mar 18, 2019

Parsing ‘* * * * * * … a’ takes quadratic time #284

Parsing ‘* * * * * * … a’ takes quadratic time #284

Comments

andersk commented Mar 11, 2019 • edited Loading

mity commented Mar 11, 2019

jgm commented Mar 11, 2019

jgm commented Mar 11, 2019

mity commented Mar 11, 2019

mity commented Mar 11, 2019

jgm commented Mar 17, 2019 • edited Loading

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

jgm commented Mar 17, 2019

mity commented Mar 17, 2019 • edited Loading

mity commented Mar 17, 2019

mity commented Mar 17, 2019

mity commented Mar 17, 2019 • edited Loading

jgm commented Mar 17, 2019

mity commented Mar 17, 2019

jgm commented Mar 17, 2019

mity commented Mar 17, 2019

jgm commented Mar 18, 2019

andersk commented Mar 11, 2019 •

edited

Loading

jgm commented Mar 17, 2019 •

edited

Loading

mity commented Mar 17, 2019 •

edited

Loading

mity commented Mar 17, 2019 •

edited

Loading