Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Parser Performance #8044

Closed
wants to merge 3 commits into from
Closed

Conversation

pento
Copy link
Member

@pento pento commented Jul 19, 2018

Description

The PHP parser has some performance issues on larger posts, I've been doing a bit of tweaking to get some more speed out of it.

The Block_List pre rule now does a simple search for strings that aren't <!--. Its a reasonable assumption that the vast majority of strings that do match this will be the start of a block, so we're probably going to the bs rule once it does match.

The Block_Balanced children rule has a similar simple search, for the same reason.

The Block_Name_Part rule explicitly defines a handful of block names. This feels a bit weird to be doing in the parser, but $( [a-z][a-z0-9_-]* ) is such a slow rule to match, that I think it's worth it.

I've also enabled pegjs' internal caching on the PHP parser, as this appear to give a decent boost, too.

How has this been tested?

Comparing the current parser with this PR, I get some pretty clear performance gains:

File Current This PR Improvement
demo-post.html 29.8ms 7.3ms 4x
shortcode-shortcomings.html 77.9ms 16.5ms 4x
redesigning-chrome-desktop.html 227.9ms 52.6ms 4x
web-at-maximum-fps.html 169.1ms 41.3ms 4x
early-adopting-the-future.html 270.8ms 62.9ms 4x
pygmalian-raw-html.html 421.9ms 254.6ms 1x
moby-dick-parsed.html 5402.3ms 1197.7ms 4x

@pento pento added [Status] In Progress Tracking issues with work in progress [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Performance Related to performance efforts labels Jul 19, 2018
@pento pento self-assigned this Jul 19, 2018
@pento pento requested review from nylen and dmsnell July 19, 2018 06:12
@pento
Copy link
Member Author

pento commented Jul 19, 2018

I've been running the parser through xdebug, which has been super useful for finding problem areas to focus on.

There are still a few areas that could use improvement:

Something in Block_Balanced is super slow, over 50% of the time spent parsing is inside that rule (not even counting sub-rules).

About a third of the remaining time is spent inside the internal input_substr rule, which I haven't figured a good way to optimise. It's a pity that the pegjs model parses an array instead of char-by-char, making this function necessary.

I'd really like if it were possible to define a rule saying "slurp until you encounter string" that generated fast code. Ideally it would just loop over Gutenberg_PEG_Parser::$input until it encountered the string, instead of the weird function calls that it seems to like.

I got fairly epic performance improvements by hacking a fast check into Gutenberg_PEG_Parser::parse() to do most of the heavy lifting for the Block_List pre rule. It reduced pygmalian-raw-html.html to ~0.1s. I couldn't see a nice way to generate that, though.

@pento
Copy link
Member Author

pento commented Jul 19, 2018

Hacking this into ::parse() (and $prepre handling into peg_join_blocks()) shifts pygmalian-raw-html.html from a 1x improvement to a 8x improvement. (In raw numbers, ~420ms down to ~50ms.)

	$prepre = '';
	for ( $ii = 0; $ii < $this->input_length - 1; $ii++ ) {
		if ( '<' === $this->input[ $ii ] && '!' === $this->input[ $ii + 1 ] ) {
			break;
		}
	}

	if ( $ii === $this->input_length - 1 ) {
		$this->cleanup_state();
		return array( array( 'attrs' => array(), 'innerHTML' => implode( $this->input ) ) );
	}

	if ( $ii > 0 ) {
		$prepre = implode( array_slice( $this->input, 0, $ii ) );
		$this->input = array_slice( $this->input, $ii );
		$this->input_length -= $ii;
	}

@gziolo
Copy link
Member

gziolo commented Jul 19, 2018

I moved blocks/api/post.pegjs to its own package in #7664. It generates new JS file for parser out of updated grammar every time you run npm install, npm run dev or npm run build. Let's make sure it gets properly regenerated with the proposed changes.

@nylen
Copy link
Member

nylen commented Jul 19, 2018

It's a pity that the pegjs model parses an array instead of char-by-char, making this function necessary.

It sounds like this is the thing to fix here. phpegjs does this to avoid issues with the ways multibyte char handling differs in JS vs PHP, see #1775.

I think that when the grammar rules themselves do not contain multibyte chars (like the Gutenberg grammar), we should be able to work directly with a string instead. How can we test and prove that?

I've also enabled pegjs' internal caching on the PHP parser, as this appear to give a decent boost, too.

How does this change affect memory usage during parsing?

@dmsnell
Copy link
Member

dmsnell commented Jul 19, 2018

Thanks for working on this @pento.

I think that this will end up failing the Block_List won't it if we have a normal HTML comment somewhere before the first block?

I've got mixed feelings too about optimizing the grammar itself. I know we have need but I'd rather see us build a new implementation of the PHP parser by hand which conforms to the grammar. We do have ways of automating the tests and performance of alternate parsers.

Since the grammar isn't all that complicated I think we can make a good one by hand that is faster than the generated one. I'd love it if we can keep the intention and specification as clear as they can be.

@pento
Copy link
Member Author

pento commented Jul 20, 2018

I think that when the grammar rules themselves do not contain multibyte chars (like the Gutenberg grammar), we should be able to work directly with a string instead. How can we test and prove that?

Yah, that should be possible. UTF-8 encoding requires that only ASCII characters can have the 0xxxxxxx bit structure, all other characters (including single-byte UTF-8 characters) must be made of bytes that have variations of the 1xxxxxxx bit structure. So, it's not possible for an ASCII character to match any part of a UTF-8 character.

I think that this will end up failing the Block_List won't it if we have a normal HTML comment somewhere before the first block?

Maybe, I didn't test it that much. 🙂

I've got mixed feelings too about optimizing the grammar itself. I know we have need but I'd rather see us build a new implementation of the PHP parser by hand which conforms to the grammar.

This is another viable option if the PCRE parser doesn't work out.

I tend to agree that, rather than optimising the grammar, a better focus would be on improving phpegjs performance for cases that are valuable to us, though that gets a bit tricky with the current grammar. For example, $("<!--" .)* could pretty clearly be optimised to use strpos() to find the next occurrence of "<!--", grab everything that happened before it, and move the currPos forward. Optimising $(!Block .)*, on the other hand, which could benefit from exactly the same optimisation, would be significantly more involved.

Hand-coding a PHP parser is an option that would probably give us better memory usage than PCRE, but I don't know how well it would do on time. It would also be less than fun to maintain, if we were to make significant changes to the grammar.

Anyway, thanks for the thoughts, y'all. I'm going to close this for now, pending the outcome of the significantly more advanced experiments in PCRE parsing.

@pento pento closed this Jul 20, 2018
@pento pento deleted the try/improve-parser-performance branch July 20, 2018 03:26
dmsnell added a commit that referenced this pull request Jul 20, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Jul 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 24, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 27, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 29, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 30, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
mcsf pushed a commit that referenced this pull request Aug 31, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Sep 3, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Sep 5, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
gziolo pushed a commit that referenced this pull request Sep 6, 2018
* Parser: Propose new hand-coded PHP parser

For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file

* Fix issue with containing the nested innerHTML

* Also handle newlines as whitespace

* Use classes for some static typing

* add type hints

* remove needless comment

* space where space is due

* meaningless rename

* remove needless function call

* harmonize with spec parser

* don't forget freeform HTML before blocks

* account for oddity in spec-parser

* add some polish, fix a thing

* comment it

* add JS version too

* Change `.` to `[^]` because `/s` isn't well supported in JS

The `s` flag on the RegExp object informs the engine to treat a
dot character as a class that includes the newline character.
Without it newlines aren't considered in the dot.

Since this flag is new to Javascript and not well supported in
different browsers I have removed it in favor of an explicit class
of characters that _does_ include the newline, namely the open
exclusion of `[^]` which permits all input characters.

Hat-top to @Hywan for finding this.

* Move code into `/packages` directory, prepare for review

* take out names from RegExp pattern to not fail tests

* Fix bug in parser: store HTML soup in stack frames while parsing

Previously we were sending all "HTML soup" segments of HTML between
blocks to the output list before any blocks were processed. We should
have been tracking these segments during the parsing and only spit
them out when closing a block at the top level.

This change stores the index into the input document at which that
soup starts if it exists and then produces the freeform block when
adding a block to the output from the parse frame stack.

* fix whitespace

* fix oddity in spec

* match styles

* use class name filter on server-side parser class

* fix whitespace

* Document extensibility

* fix typo in example code

* Push failing parsing test

* fix lazy/greedy bug in parser regexp

* Docs: Fix typos, links, tweak style.

* update from PR feedback

* trim docs

* Load default block parser, replacing PEG-generated one

* Expand `?:` shorthand for PHP 5.2 compat

* add fixtures test for default parser

* spaces to tabs

* could we need no assoc?

* fill out return array

* put that assoc back in there

* isometrize

* rename and add 0

* Conditionally include the parser class

* Add docblocks

* Standardize the package configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Status] In Progress Tracking issues with work in progress [Type] Performance Related to performance efforts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants