-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parser: Propose new hand-coded parser #8083
Merged
Merged
Changes from all commits
Commits
Show all changes
43 commits
Select commit
Hold shift + click to select a range
bf4c1e1
Parser: Propose new hand-coded PHP parser
dmsnell bb7ff54
Fix issue with containing the nested innerHTML
dmsnell a905f58
Also handle newlines as whitespace
dmsnell 1bebf95
Use classes for some static typing
dmsnell e0256b3
add type hints
dmsnell 070b4f2
remove needless comment
dmsnell 987b6e6
space where space is due
dmsnell 92c110d
meaningless rename
dmsnell 21132d3
remove needless function call
dmsnell 474eab3
harmonize with spec parser
dmsnell 4501e9a
don't forget freeform HTML before blocks
dmsnell 6ed9e50
account for oddity in spec-parser
dmsnell 029feb0
add some polish, fix a thing
dmsnell 5230045
comment it
dmsnell 760ad75
add JS version too
dmsnell ce42f86
Change `.` to `[^]` because `/s` isn't well supported in JS
dmsnell 3ed3424
Move code into `/packages` directory, prepare for review
dmsnell a448817
take out names from RegExp pattern to not fail tests
dmsnell ed917f3
Fix bug in parser: store HTML soup in stack frames while parsing
dmsnell b440a86
fix whitespace
dmsnell 76c8d50
fix oddity in spec
dmsnell e9bd804
match styles
dmsnell 1e91266
use class name filter on server-side parser class
dmsnell e80a6d9
fix whitespace
dmsnell 45d7c7b
Document extensibility
dmsnell 1b7592a
fix typo in example code
dmsnell c60b95d
Push failing parsing test
mcsf 10a2097
fix lazy/greedy bug in parser regexp
dmsnell 6a232a4
Docs: Fix typos, links, tweak style.
mcsf cb13b54
update from PR feedback
dmsnell ce1864f
trim docs
dmsnell f5b97a6
Load default block parser, replacing PEG-generated one
mcsf 9c72d5e
Expand `?:` shorthand for PHP 5.2 compat
mcsf a57e448
add fixtures test for default parser
dmsnell 20e6131
spaces to tabs
dmsnell 0bd5e71
could we need no assoc?
dmsnell 08015d7
fill out return array
dmsnell 1004cbe
put that assoc back in there
dmsnell 3dc74fd
isometrize
dmsnell 22f10de
rename and add 0
dmsnell a41a995
Conditionally include the parser class
jorgefilipecosta fe98a4a
Add docblocks
dmsnell 9463906
Standardize the package configuration
gziolo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Extending the Parser | ||
|
||
When the editor is interacting with blocks, these are stored in memory as data structures comprising a few basic properties and attributes. Upon saving a working post we serialize these data structures into a specific HTML structure and save the resultant string into the `post_content` property of the post in the WordPress database. When we load that post back into the editor we have to make the reverse transformation to build those data structures from the serialized format in HTML. | ||
|
||
The process of loading the serialized HTML into the editor is performed by the _block parser_. The formal specification for this transformation is encoded in the parsing expression grammar (PEG) inside the `@wordpress/block-serialization-spec-parser` package. The editor provides a default parser implementation of this grammar but there may be various reasons for replacing that implementation with a custom implementation. We can inject our own custom parser implementation through the appropriate filter. | ||
|
||
## Server-side parser | ||
|
||
Plugins have access to the parser if they want to process posts in their structured form instead of a plain HTML-as-string representation. | ||
|
||
## Client-side parser | ||
|
||
The editor uses the client-side parser while interactively working in a post. The plain HTML-as-string representation is sent to the browser by the backend and then the editor performs the first parse to initialize itself. | ||
|
||
## Filters | ||
|
||
To replace the server-side parser, use the `block_parser_class` filter. The filter transforms the string class name of a parser class. This class is expected to expose a `parse` method. | ||
|
||
_Example:_ | ||
|
||
```php | ||
class EmptyParser { | ||
public function parse( $post_content ) { | ||
// return an empty document | ||
return array(); | ||
} | ||
} | ||
|
||
function my_plugin_select_empty_parser( $prev_parser_class ) { | ||
return 'EmptyParser'; | ||
} | ||
|
||
add_filter( 'block_parser_class', 'my_plugin_select_empty_parser', 10, 1 ); | ||
``` | ||
|
||
> **Note**: At the present time it's not possible to replace the client-side parser. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
package-lock=false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
## 1.0.0 | ||
|
||
- Initial release. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
# Block Serialization Default Parser | ||
|
||
This library contains the default block serialization parser implementations for WordPress documents. It provides native PHP and JavaScript parsers that implement the specification from `@wordpress/block-serialization-spec-parser` and which normally operates on the document stored in `post_content`. | ||
|
||
## Installation | ||
|
||
Install the module | ||
|
||
```bash | ||
npm install @wordpress/block-serialization-default-parser --save | ||
``` | ||
|
||
_This package assumes that your code will run in an **ES2015+** environment. If you're using an environment that has limited or no support for ES2015+ such as lower versions of IE then using [core-js](https://github.com/zloirock/core-js) or [@babel/polyfill](https://babeljs.io/docs/en/next/babel-polyfill) will add support for these methods. Learn more about it in [Babel docs](https://babeljs.io/docs/en/next/caveats)._ | ||
|
||
## Usage | ||
|
||
Input post: | ||
```html | ||
<!-- wp:columns {"columns":3} --> | ||
<div class="wp-block-columns has-3-columns"><!-- wp:column --> | ||
<div class="wp-block-column"><!-- wp:paragraph --> | ||
<p>Left</p> | ||
<!-- /wp:paragraph --></div> | ||
<!-- /wp:column --> | ||
|
||
<!-- wp:column --> | ||
<div class="wp-block-column"><!-- wp:paragraph --> | ||
<p><strong>Middle</strong></p> | ||
<!-- /wp:paragraph --></div> | ||
<!-- /wp:column --> | ||
|
||
<!-- wp:column --> | ||
<div class="wp-block-column"></div> | ||
<!-- /wp:column --></div> | ||
<!-- /wp:columns --> | ||
``` | ||
|
||
Parsing code: | ||
```js | ||
import { parse } from '@wordpress/block-serialization-default-parser'; | ||
|
||
parse( post ) === [ | ||
{ | ||
blockName: "core/columns", | ||
attrs: { | ||
columns: 3 | ||
}, | ||
innerBlocks: [ | ||
{ | ||
blockName: "core/column", | ||
attrs: null, | ||
innerBlocks: [ | ||
{ | ||
blockName: "core/paragraph", | ||
attrs: null, | ||
innerBlocks: [], | ||
innerHTML: "\n<p>Left</p>\n" | ||
} | ||
], | ||
innerHTML: '\n<div class="wp-block-column"></div>\n' | ||
}, | ||
{ | ||
blockName: "core/column", | ||
attrs: null, | ||
innerBlocks: [ | ||
{ | ||
blockName: "core/paragraph", | ||
attrs: null, | ||
innerBlocks: [], | ||
innerHTML: "\n<p><strong>Middle</strong></p>\n" | ||
} | ||
], | ||
innerHTML: '\n<div class="wp-block-column"></div>\n' | ||
}, | ||
{ | ||
blockName: "core/column", | ||
attrs: null, | ||
innerBlocks: [], | ||
innerHTML: '\n<div class="wp-block-column"></div>\n' | ||
} | ||
], | ||
innerHTML: '\n<div class="wp-block-columns has-3-columns">\n\n\n\n</div>\n' | ||
} | ||
]; | ||
``` | ||
|
||
## Theory | ||
|
||
### What is different about this one from the spec-parser? | ||
|
||
This is a recursive-descent parser that scans linearly once through the input document. Instead of directly recursing it utilizes a trampoline mechanism to prevent stack overflow. It minimizes data copying and passing through the use of globals for tracking state through the parse. Between every token (a block comment delimiter) we can instrument the parser and intervene should we want to; for example we might put a hard limit on how long we can be parsing a document or provide additional debugging diagnostics for a document. | ||
|
||
The spec parser is defined via a _Parsing Expression Grammar_ (PEG) which answers many questions inherently that we must answer explicitly in this parser. The goal for this implementation is to match the characteristics of the PEG so that it can be directly swapped out and so that the only changes are better runtime performance and memory usage. | ||
|
||
### How does it work? | ||
|
||
Every serialized Gutenberg document is nominally an HTML document which, in addition to normal HTML, may also contain specially designed HTML comments -- the block comment delimiters -- which separate and isolate the blocks serialized in the document. | ||
|
||
This parser attempts to create a state-machine around the transitions triggered from those delimiters -- the "tokens" of the grammar. Every time we find one we should only be doing either of: | ||
|
||
- enter a new block; | ||
- exit out of a block. | ||
|
||
Those actions have different effects depending on the context; for instance, when we exit a block we either need to add it to the output block list _or_ we need to append it as the next `innerBlock` on the parent block below it in the block stack (the place where we track open blocks). The details are documented below. | ||
|
||
The biggest challenge in this parser is making the right accounting of indices required to construct the `innerHTML` values for each block at every level of nesting depth. We take a simple approach: | ||
|
||
- Start each newly opened block with an empty `innerHTML`. | ||
- Whenever we push a first block into the `innerBlocks` list, add the content from where the content of the parent block started to where this inner block starts. | ||
- Whenever we push another block into the `innerBlocks` list, add the content from where the previous inner block ended to where this inner block starts. | ||
- When we close out an open block, add the content from where the last inner block ended to where the closing block delimiter starts. | ||
- If there are no inner blocks then we take the entire content between the opening and closing block comment delimiters as the `innerHTML`. | ||
|
||
### I meant, how does it perform? | ||
|
||
This parser operates much faster than the generated parser from the specification. Because we know more about the parsing than the PEG does we can take advantage of several tricks to improve our speed and memory usage: | ||
|
||
- We only have one or two distinct tokens, depending on how you look at it, and they are all readily matched via a regular expression. Instead of parsing on a character-per-character basis we can allow the PCRE RegExp engine to skip over large swaths of the document for us in order to find those tokens. | ||
- Since `preg_match()` takes an `offset` parameter we can crawl through the input without passing copies of the input text on every step. We can track our position in the string and only pass a number instead. | ||
- Not copying all those strings means that we'll also skip many memory allocations. | ||
|
||
Further, tokenizing with a RegExp brings an additional advantage. The parser generated by the PEG provides predictable performance characteristics in exchange for control over tokenization rules -- it doesn't allow us to define RegExp patterns in the rules so as to guard against _e.g._ cataclysmic backtracking that would break the PEG guarantees. | ||
|
||
However, since our "token language" of the block comment delimiters is _regular_ and _can_ be trivially matched with RegExp patterns, we can do that here and then something magical happens: we jump out of PHP or JavaScript and into a highly-optimized RegExp engine written in C or C++ on the host system. We thereby leave the virtual machine and its overhead. | ||
|
||
<br/><br/><p align="center"><img src="https://s.w.org/style/images/codeispoetry.png?1" alt="Code is Poetry." /></p> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
{ | ||
"name": "@wordpress/block-serialization-default-parser", | ||
"version": "1.0.0-rc.0", | ||
"description": "Block serialization specification parser for WordPress posts.", | ||
"author": "The WordPress Contributors", | ||
"license": "GPL-2.0-or-later", | ||
"keywords": [ | ||
"wordpress", | ||
"block", | ||
"parser" | ||
], | ||
"homepage": "https://github.com/WordPress/gutenberg/tree/master/packages/block-serialization-default-parser/README.md", | ||
"repository": { | ||
"type": "git", | ||
"url": "https://github.com/WordPress/gutenberg.git" | ||
}, | ||
"bugs": { | ||
"url": "https://github.com/WordPress/gutenberg/issues" | ||
}, | ||
"main": "build/index.js", | ||
"module": "build-module/index.js", | ||
"react-native": "src/index", | ||
"dependencies": { | ||
"@babel/runtime": "^7.0.0" | ||
}, | ||
"publishConfig": { | ||
"access": "public" | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this conditional load as suggested by @mcsf. I did some tests and this approach seems to work well.
I don't like the fact that we are requiring an external file in the scope of a function but in this case, it seems the best alternative. The file being included implements a class and in PHP they are global so this definition works even with the include inside the function.
If someone has a better idea for this I'm totally open to change the approach. If for some reason the commit feels wrong or we decide to have this in its open PR after landing the main functionality I'm fine with this commit being discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I have no problem with that