Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an XML serializer. #21

Closed
wants to merge 15 commits into from
Closed

Add an XML serializer. #21

wants to merge 15 commits into from

Conversation

dmsnell
Copy link
Owner

@dmsnell dmsnell commented Sep 11, 2024

Built from WordPress#7331

Provides a mechanism to serialize an HTML fragment to the XML syntax. YOU PROBABLY SHOULDN'T USE THIS!!!!

REMEMBER that so-called "XHTML" served without a path ending in .xml or without the Content-type: application/xml+xhtml HTTP header will render as HTML and ONE SHOULD NOT SERVE XML/XHTML as HTML!!!

php > var_dump( ( WP_HTML_Processor::create_fragment( '<p>an <img> is worth &AElig thousand words' ) )->serialize_to_xml() );
string(43) "<p>an <img /> is worth Æ thousand words</p>"
php > var_dump( ( WP_HTML_Processor::create_fragment( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(200) "<svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p>"
php > var_dump( ( WP_HTML_Processor::create_full_parser( '<svg><foreignObject><p>Test<svg><text>Smile</text></p></foreignObject><p>test' ) )->serialize_to_xml() );
string(315) "<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><svg xmlns="http://www.w3.org/2000/svg"><foreignObject xmlns="http://www.w3.org/1999/xhtml"><p>Test<svg xmlns="http://www.w3.org/2000/svg"><text>Smile</text></svg></p></foreignObject></svg><p>test</p></body></html>"

Extremely rare cases when it's appropriate to use this

  • Exporting HTML content into an Atom feed without escaping it. HTML may/ought to be escaped like <content type="html">&lt;p&gt;yay&lt;/&gt;</content>, but if the document can be serialized into <content type="xhtml" xmlns="http://www.w3.org/1999/xhtml"><p>yay</p></content>.
  • When attempting to directly embed HTML content into any other XML document without escaping it.

HTML generally cannot be expressed in XML, and according to the HTML specification, Using the XML syntax is not recommended! Prefer escaping the HTML to avoid corruption and data loss.

dmsnell and others added 12 commits September 11, 2024 09:37
The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.

This patch introduces `WP_HTML_Processor::normalize( $html )` as a
method which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.

Core-62036
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
If code later in the processing pipeline adds unquoted attributes
and doesn't add the requisite space following that, then another
parser might find that the solidus is part of the attribute value
instead of serving as a self-closing flag.

Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Co-authored-by: Weston Ruter <westonruter@git.wordpress.org>
Copy link

github-actions bot commented Sep 11, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props flixos90, dmsnell, peterwilsoncc, sergeybiryukov, gziolo, swissspidy, desrosj, johnbillion, timothyblynjacobs, davidbaumwald, antpb, kadamwhite, jorbin, joedolson, adamsilverstein, helen, drewapicture, jeremyfelt, ramonopoly, hellofromtonya, poena, noisysocks.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell
Copy link
Owner Author

dmsnell commented Sep 11, 2024

Did you make it past here and think it's still appropriate or somehow better to serve XHTML?

Please just send HTML - it's a different language than XHTML and it's recommended to avoid XHTML. XML cannot adequately express HTML documents.

@dmsnell dmsnell changed the base branch from html-api/normalize-html to trunk September 21, 2024 00:10
@dmsnell
Copy link
Owner Author

dmsnell commented Sep 21, 2024

Replaced by WordPress#7408

@dmsnell dmsnell closed this Sep 21, 2024
@dmsnell dmsnell deleted the html-api/normalize-to-xml branch September 21, 2024 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant