Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing <!DOCTYPE html> #71

Merged
merged 2 commits into from
Feb 3, 2021
Merged

Conversation

rwjblue
Copy link
Collaborator

@rwjblue rwjblue commented Sep 12, 2019

The spec says this about <!DOCTYPE:

DOCTYPEs are required for legacy reasons. When omitted, browsers tend to
use a different rendering mode that is incompatible with some
specifications. Including the DOCTYPE in a document ensures that the
browser makes a best-effort attempt at following the relevant
specifications.

This fixes an issue where we would end up in an incorrect state when the <!DOCTYPE declaration was found (e.g. ember-template-lint/ember-template-lint#719).

Addresses ember-template-lint/ember-template-lint#719
Addresses stefanpenner/find-scripts-srcs-in-document#1


The specific breaking changes here are that the delegate now must have the following new methods:

  beginDoctype(): void;
  appendToDoctypeName(char: string): void;
  appendToDoctypePublicIdentifier(char: string): void;
  appendToDoctypeSystemIdentifier(char: string): void;
  endDoctype(): void;

Closes #28.

@rwjblue rwjblue changed the title Add support for parsing <!DOCTYPE html> WIP Add support for parsing <!DOCTYPE html> Sep 12, 2019
@rwjblue rwjblue changed the title WIP Add support for parsing <!DOCTYPE html> Add support for parsing <!DOCTYPE html> Sep 16, 2019
@Turbo87
Copy link
Contributor

Turbo87 commented Dec 12, 2019

The specific breaking changes here are that the delegate now must have the following new methods

can we make this non-breaking by calling those methods only if they exist?

@wycats
Copy link
Contributor

wycats commented May 11, 2020

for the benefit of historical info: the original theory of this library is that it was basically for body templates, and therefore I didn't implement the states for doctype/script, etc. This was in the interest of keeping the library reasonably small: the states I left out are something like half of all tokenizer states!

I have no problem with working on adding in those states now, especially since the main use-case for this library ends up being preprocessing, which happens in contexts where size doesn't matter so much.

@krisselden
Copy link
Collaborator

@rwjblue parse5 likely is a better fit for embroiders use case /cc @ef4

@ef4
Copy link
Collaborator

ef4 commented Jun 5, 2020

Agreed. We need a complete parser and serializer.

The
[spec](https://html.spec.whatwg.org/multipage/syntax.html#the-doctype)
says this about `<!DOCTYPE`:

> DOCTYPEs are required for legacy reasons. When omitted, browsers tend to
> use a different rendering mode that is incompatible with some
> specifications. Including the DOCTYPE in a document ensures that the
> browser makes a best-effort attempt at following the relevant
> specifications.
This allows the change to be non-breaking.
@rwjblue rwjblue merged commit 074f3c1 into tildeio:master Feb 3, 2021
@rwjblue rwjblue deleted the doctype branch February 3, 2021 20:43
@rwjblue
Copy link
Collaborator Author

rwjblue commented Feb 4, 2021

Apologies for not leaving a comment above when I reopened / merged this. I would like to move this forward (and begin expanding the scope of this library slowly) because I believe that the path forward in SSR is to have the template own the full HTML (instead of having the template rendered output spliced into an HTML content string). Doing this fixes some things that are quite annoying today (e.g. rendering custom <head> content from an Ember / Glimmer.js app).

I will try to investigate migrating @glimmer/syntax to leveraging parse5 instead of simple-html-tokenizer though, I'll open another issue on glimmerjs/glimmer-vm for that.

@wycats
Copy link
Contributor

wycats commented Feb 5, 2021

@rwjblue We definitely need to talk about this before you make any further steps in that direction, but I'm not intrinsically opposed to the approach you have in mind.

@rwjblue
Copy link
Collaborator Author

rwjblue commented Feb 5, 2021

@wycats yep, I was mostly just going to see if it were possible (seems like it should be)

@wycats
Copy link
Contributor

wycats commented Feb 5, 2021

@rwjblue My main concerns would be:

  • our extensions to valid HTML (tag names that start with @ or :)
  • the separation of the lexer and parser, as well as "partial lex mode", which allow us to "splice in" {{...}} tokens in places where they would be illegal (or lex incorrectly)
    • this allows us to support <a href={{some helper "inner string"}}>, which is very difficult in traditional HTML parsers
  • our desire to flag some amount of invalid HTML (most "real-world" parsers fully embrace the error-correcting mode) that is consistent with our extensions (@ and : tags, @ attributes, and curlies in many positions that would be invalid in HTML, especially when they contain nested strings)
  • our ability to directly control the lexer codebase to give correct source locations in error cases (it's not perfect right now, but our control over the codebase has already been useful and would allow us to continue to fix bugs over time).
  • the size of the codebase for hypothetical future in-browser parsing scenarios (HTML5 parsers tend to be big)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

can't parse <!doctype>
6 participants