Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSES: Allow Custom Data Attributes having dashes in their dataset name. #6429

Open
wants to merge 15 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Apr 22, 2024

Trac ticket: Core-61501. (was Core-61052)
Alternative in #6598

Allow for additional custom data attributes in wp_kses_attr_check().

In this patch the set of allowable custom data attribute names is expanded to permit those including dashes in the dataset name. The term dataset name here means the name appearing to JavaScript in a browser. This is found by stripping away the data- prefix, lowercasing, and then turning every dash-letter combination into a capital letter.

The Interactivity API relies on data attributes of the form data-wp-bind--enabled. The corresponding dataset name here is wpBind-Enabled, and is currently blocked by wp_kses_attr_check(). This patch allows that and any other data attribute through which would include those dashes in the translated name.

This patch is structured in two parts:

  1. Ensure that Core can recognize whether a given attribute represents a custom data attribute.
  2. Once recognizing the custom data attributes, apply a WordPress-specific set of constraints to determine whether to allow it.

The effect of this change is allowing a double-dash when previously only a single dash after the initial data- prefix was allowed. It's more complicated than this, because, as evidenced in the test suite, double dashes translate into different names based on where they appear and what follows them in the attribute name.

Code used to produce this table
<style>
td, th {
	margin-right: 1em;
}
</style>
<?php

require_once __PATH_TO_REPO__ . '/wordpress-develop/src/wp-load.php';

$variations = [
	'data-',
	'post-id',
	'data-post-id',
	'data-Post-ID',
	'data-uPPer-cAsE',
	"data-\u{2003}",
	'data-🐄-pasture',
	'data-wp-bind--enabled',
	'data-over_under',
	'data--before',
	'data-after-',
	'data-after-after--',
	'data--one--two--',
	'data---one---two---',
	'data-[wish:granted]',
];

echo '<table><tr><th>Attribute Name<th>Dataset Name<th>Before<th>After</tr>';
foreach ( $variations as $name ) {
	$_name    = $name;
	$_value   = 'some-value';
	$_whole   = "{$_name}=\"{$_value}\"";
	$_vless   = 'n';
	$_element = 'div';
	$_allowed_html = [ 'div' => [ 'data-*' => true ] ];

	$was_allowed = wp_old_kses_check( $_name, $_value, $_whole, $_vless, $_element, $_allowed_html );
	$was_allowed = $was_allowed ? '' : '🚫';

	$_name    = $name;
	$_value   = 'some-value';
	$_whole   = "{$_name}=\"{$_value}\"";
	$_vless   = 'n';
	$_element = 'div';
	$_allowed_html = [ 'div' => [ 'data-*' => true ] ];

	$is_allowed = wp_kses_attr_check( $_name, $_value, $_whole, $_vless, $_element, $_allowed_html );
	$is_allowed = $is_allowed ? '' : '🚫';

	echo <<<HTML
	<tr>
	<th><code>{$name}</code>
	<td style="font-family: monospace;" is-data {$name}>{$name}
	<td>{$was_allowed}
	<td>{$is_allowed}
	HTML;
}
echo '</table>';

?>
<script>
document.querySelectorAll( '[is-data]' ).forEach( node => {
	node.innerText = Object.keys( node.dataset ).join( ', ' );
} );
</script>
<?php

function wp_old_kses_check( &$name, &$value, &$whole, $vless, $element, $allowed_html ) {
	$name_low    = strtolower( $name );
	$element_low = strtolower( $element );

	if ( ! isset( $allowed_html[ $element_low ] ) ) {
		$name  = '';
		$value = '';
		$whole = '';
		return false;
	}

	$allowed_attr = $allowed_html[ $element_low ];

	if ( ! isset( $allowed_attr[ $name_low ] ) || '' === $allowed_attr[ $name_low ] ) {
		/*
		 * Allow `data-*` attributes.
		 *
		 * When specifying `$allowed_html`, the attribute name should be set as
		 * `data-*` (not to be mixed with the HTML 4.0 `data` attribute, see
		 * https://www.w3.org/TR/html40/struct/objects.html#adef-data).
		 *
		 * Note: the attribute name should only contain `A-Za-z0-9_-` chars,
		 * double hyphens `--` are not accepted by WordPress.
		 */
		if ( str_starts_with( $name_low, 'data-' ) && ! empty( $allowed_attr['data-*'] )
			&& preg_match( '/^data(?:-[a-z0-9_]+)+$/', $name_low, $match )
		) {
			/*
			 * Add the whole attribute name to the allowed attributes and set any restrictions
			 * for the `data-*` attribute values for the current element.
			 */
			$allowed_attr[ $match[0] ] = $allowed_attr['data-*'];
		} else {
			$name  = '';
			$value = '';
			$whole = '';
			return false;
		}
	}

	if ( 'style' === $name_low ) {
		$new_value = safecss_filter_attr( $value );

		if ( empty( $new_value ) ) {
			$name  = '';
			$value = '';
			$whole = '';
			return false;
		}

		$whole = str_replace( $value, $new_value, $whole );
		$value = $new_value;
	}

	if ( is_array( $allowed_attr[ $name_low ] ) ) {
		// There are some checks.
		foreach ( $allowed_attr[ $name_low ] as $currkey => $currval ) {
			if ( ! wp_kses_check_attr_val( $value, $vless, $currkey, $currval ) ) {
				$name  = '';
				$value = '';
				$whole = '';
				return false;
			}
		}
	}

	return true;
}
Screenshot 2024-05-22 at 10 58 02 AM

These are recommendations. If these naming recommendations are not followed, no errors will occur. The attributes will still be matched using CSS attribute selectors, with the attribute being case insensitive and any attribute value being case-sensitive. Attributes not conforming to these three recommendations will also still be recognized by the JavaScript HTMLElement.dataset property and user-agents will include the attribute in the DOMStringMap containing all the custom data attributes for an HTMLElement.

https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/data-*

Allow spec-compliant data-attributes in `wp_kses_attr_check()`
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell marked this pull request as ready for review April 24, 2024 09:20
Copy link

github-actions bot commented Apr 24, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, peterwilsoncc.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@sirreal
Copy link
Member

sirreal commented May 21, 2024

I dug into the relevant section of the spec, but I think the interesting point here is that data attributes in the browser are handled much more loosely (as already mentioned in the description). The spec seems to explicitly disallow ASCII uppercase and : in data attributes, but Chrome doesn't seem to care at all.

<div data-XYZ:ABC="works">ok</div>
document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeType === document.ATTRIBUTE_NODE
// true
document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeName
// "data-xyz:abc"
document.querySelector('div').attributes.getNamedItem('data-xyz:abc').nodeValue
// "works"
document.querySelector('div').dataset['xyz:abc'];
// "works"
specs and references

A custom data attribute is an attribute in no namespace whose name starts with the string "data-", has at least one character after the hyphen, is XML-compatible, and contains no ASCII upper alphas.

(then)

Attribute names are said to be XML-compatible if they match the Name production defined in XML and they contain no U+003A COLON characters (:). [XML]

(finally):

Names and Tokens
[4]  NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]  Name          ::= NameStartChar (NameChar)*
[6]  Names         ::= Name (#x20 Name)*
[7]  Nmtoken       ::= (NameChar)+
[8]  Nmtokens      ::= Nmtoken (#x20 Nmtoken)*

*/
if ( str_starts_with( $name_low, 'data-' ) && ! empty( $allowed_attr['data-*'] )
&& preg_match( '/^data(?:-[a-z0-9_]+)+$/', $name_low, $match )
&& preg_match( '~^data-[^=/> \\t\\f\\r\\n]+$~', $name_low, $match )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for matching the inverse of this set of characters specifically? Is this based on the specification for attributes names in HTML?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, because data attributes themselves are just attributes. there's a large gap of attribute names based on matching ASCII a-z. emoji can form attributes names, for example.

now it might be reasonable to ask, why add this if we don't want to support all those? but the purpose of this detector is to validate data attributes. if we add additional rules above the spec for these, we will miss validating certain data attributes we didn't anticipate.

part of me also didn't want to do this, so maybe what I wrote is wrong, because it ultimately matters most which of these will be turned into the dataset in the browser so I'm happy to change this if we think we need to, but it seemed like they all were collected in my testing, and I didn't want to invite escape hatches based on ad-hoc pattern-matching.

Comment on lines 1365 to 1366
$test = '<div data-foo="foo" data-bar="bar" datainvalid="gone" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>';
$expected = '<div data-foo="foo" data-bar="bar" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see cases covering data--leading data-trailing- and data-middle--double cases. The -- together in the middle is missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change it would be good to convert this to use a data provider and add a bunch more valid and invalid use cases.

It probably should have been a data provider all along but please do not look in to the history of this feature to figure out who didn't do that in the first place. ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted in 5294a86

@dmsnell
Copy link
Member Author

dmsnell commented May 21, 2024

I dug into the relevant section of the spec, but I think the interesting point here is that data attributes in the browser are handled much more loosely (as already mentioned in the description). The spec seems to explicitly disallow ASCII uppercase and : in data attributes, but Chrome doesn't seem to care at all.

this is going to be another case where it's evident that the HTML5 specification is a family of interrelated specifications for different purposes. there are different rules for what we allow producing than there are for what we demand for reading.

for example, attributes aren't allowed to be created with brackets in their name, like [if]="var.hasAccent" and yet every parser must recognize this as an attribute. so it's more or less a blur and everything that's allowed is spec, but then we "shouldn't" do things that are against the rules, but they are benign.

on data attributes what I was most concerned about is matching browser behavior. we should have a direct correspondence between the attributes on the server that appear in a dataset for an Element in the browser. if we find cases that all three browsers reject, let's add those in the test suite and note the omission.

@dmsnell
Copy link
Member Author

dmsnell commented May 22, 2024

I've rearranged the code somewhat, doing what I originally wanted to avoid, in creating a new function. Still, the desire to add a test suite overcame my hesitation.

Now, I've added wp_kses_transform_custom_data_attribute_name, which is a mouthful, but does exactly what I believe Core needs to do. That is, it needs to start by recognizing what is and what isn't a data attribute.

Beyond this, we can talk about arbitrarily rejecting some of them, but we can add that on top of this work.

That is, if we only want to allow a subset of data attributes, we can start inside the if where we've detected one, and then reject it, knowing for sure we didn't miss any.

By the way, I ran a test in Safari, Chrome, and Firefox to confirm the test cases I added. They all operate uniformly. Surprisingly, data- (with nothing after the one -) is a valid data attribute whose name is the empty string. This test confirms the relationship between attribute name and appearing in an element's dataset property.

Test Code
<?php

$variations = [
	'data-',
	'post-id',
	'data-post-id',
	'data-Post-ID',
	"data-\u{2003}",
	'data-🐄-pasture',
	'data-wp-bind--enabled',
	'data--before',
	'data-after-',
	'data---one---two---',
	'data-[wish:granted]',
];

foreach ( $variations as $name ) {
	echo "<div><code>{$name}</code>: <span {$name}='{$name}'></span></div>\n";
}

?>
<script>
document.querySelectorAll( 'span' ).forEach( node => {
	node.innerText = JSON.stringify( node.dataset );
} );
</script>
Screenshot 2024-05-21 at 5 11 19 PM

Copy link
Contributor

@peterwilsoncc peterwilsoncc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some early notes, will try to get to a deeper dive later this week.

Comment on lines 1422 to 1428

// Unexpected but recognized custom data attributes.
'Only comprising a prefix' => array( 'data-', '' ),
'With upper case ASCII' => array( 'data-Post-ID', 'postId' ),
'With Unicode whitespace' => array( "data-\u{2003}", "\u{2003}" ),
'With Emoji' => array( 'data-🐄-pasture', '🐄Pasture' ),
'Brackets and colon' => array( 'data-[wish:granted]', '[wish:granted]' ),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordPress can be stricter than the specs and browser implementations (especially that latter if they differ from the former), to an extent that's the entire purpose of KSES.

I don't think accounting for special characters is overly wise as it's too easy to miss situations in which a contributor or author role can break out of the intended behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @peterwilsoncc! in the code for this PR there are two stages: the first is determining if there's a data attribute, and then the second is locking down a smaller allowable subset. these tests, and this function, is to ensure that we can all agree on what is a data attribute before we then apply further constraints.

it doesn't have to be this way, of course, but I find that it's really easy to let things slide through when we conflate the two ideas. in this particular case it looks like everything else should be eliminated in wp_kses_attr_check() other than the specific data attributes, but I also had to double and triple check the logic to make sure we weren't letting through some data attributes only because they didn't fit the pattern we expected.

so this was done for the reason of making it less likely to let something slip through. it doesn't have to be done this way, but it was the purpose for the structuring I made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this was done for the reason of making it less likely to let something slip through. it doesn't have to be done this way, but it was the purpose for the structuring I made.

Thanks @dmsnell that makes a lot of sense. 👯

Comment on lines 1365 to 1366
$test = '<div data-foo="foo" data-bar="bar" datainvalid="gone" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>';
$expected = '<div data-foo="foo" data-bar="bar" data--double-dash="retained" data-trailing-dash-="allowable" data-two-hyphens="remains">Pens and pencils</div>';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change it would be good to convert this to use a data provider and add a bunch more valid and invalid use cases.

It probably should have been a data provider all along but please do not look in to the history of this feature to figure out who didn't do that in the first place. ;)

@dmsnell dmsnell changed the title KSES: Allow all Custom Data Attributes KSES: Allow Custom Data Attributes having dashes in their dataset name. May 22, 2024
@dmsnell
Copy link
Member Author

dmsnell commented May 22, 2024

Interestingly, before this change data--before would have been rejected while afterwards it's allowed. The name of the dataset value doesn't have any dashes though, it's Before

@dmsnell
Copy link
Member Author

dmsnell commented May 22, 2024

@peterwilsoncc I've refactored the test in 5294a86

Also I've updated the PR description to reflect how this PR is no longer attempting to allow all legitimate custom data attributes, but still only those WordPress special-cases, plus the ones with a dash in their dataset name.

Of note, this is slightly different than the approach in #6598. Instead of deciding to accept or reject based on the attribute name, in this patch WordPress is basing the decision off of the translated name. This subtlety means that the test is more directly related to the name of the custom data attribute.

For example, if we only test for the presence of the double dash --, we are comparing that against three different outputs: one with no dashes, one with one dash, and one with two dashes. This is evidenced in the screenshot of the table at the start of this post with the custom data attribute data--one--two-- whose name in JavaScript becomes One-Two--. Each double-dash in that name corresponds to a different translation.

Now, the rule to allow a - in the translated name implicitly allows double-dashes in the attribute name because that's the only way to produce one in the translated name. For example, wpBind-Enabled.

Copy link
Contributor

@peterwilsoncc peterwilsoncc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more notes after going through this more thoroughly.

I'm a little unclear as to the need for the new function as it's the attribute name that KSES is concerned about, the validation regex within the existing function should be for convertible names.

A function converting an attribute name to it's JavaScript equivalent may be useful as an escaping function for inline scripts but isn't directly related to KSES.

tests/phpunit/tests/kses.php Show resolved Hide resolved
* @param bool $is_allowed Whether the given attribute should be allowed.
*/
public function test_wp_kses_attr_data_attribute_is_allowed( $attribute_name, $is_allowed ) {
$element = "<div {$attribute_name}>Pens and pencils.</div>";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also test "<div {$attribute_name}="populated">Pens and pencils.</div>" -- either in this test or another one using the same data provider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do! Out of curiosity, why do this, since kses doesn't make any distinction for whether to allow a data attribute or not based on its value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do this, since kses doesn't make any distinction for whether to allow a data attribute or not based on its value?

Mainly as a defensive strategy in case a future maintainer (quite possibly us) makes a mistake. The execution path of KSES is a little very confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a second test for value-containing attributes in 0498bad

@@ -1311,6 +1319,54 @@ function wp_kses_attr_check( &$name, &$value, &$whole, $vless, $element, $allowe
return true;
}

/**
* If an attribute name represents a custom data attribute, return the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds a summary so the details are all included in the description section of the developer docs (see wp_kses() as an example).

Suggested change
* If an attribute name represents a custom data attribute, return the
* Convert attribute to JavaScript `dataset` form if allowed.
*
* If an attribute name represents a custom data attribute, return the

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterwilsoncc are you recommending a different wording of the summary, or worried that the existing summary won't appear in the docs? The existing summary is up to two lines at the start of the docblock, so this will appear as-written in the patch.

For example, class_name_updates_to_attributes_updates() also has a two-line summary.

Screenshot 2024-05-22 at 4 40 01 PM Screenshot 2024-05-22 at 4 39 33 PM Screenshot 2024-05-22 at 4 39 38 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mainly suggesting different, shorter, phrasing.

I thought the summaries were intended to be a one-liner but just re-read the doc standards and it turns out I was mistaken.

I'll leave this up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded in dc859f1. It may take some more rewording 🤷‍♂️

src/wp-includes/kses.php Outdated Show resolved Hide resolved
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very good fix without sacrifices. Thank you!

Comment on lines +1398 to +1418
public static function data_possible_custom_data_attributes_and_transformed_names() {
return array(
// Non-custom-data attributes.
'Normal attribute' => array( 'post-id', null ),
'Single word' => array( 'id', null ),

// Normative custom data attributes.
'Normal custom data attribute' => array( 'data-post-id', 'postId' ),
'Leading dash' => array( 'data--before', 'Before' ),
'Trailing dash' => array( 'data-after-', 'after-' ),
'Double-dashes' => array( 'data-wp-bind--enabled', 'wpBind-Enabled' ),
'Double-dashes everywhere' => array( 'data--one--two--', 'One-Two--' ),
'Triple-dashes' => array( 'data---one---two---', '-One--Two---' ),

// Unexpected but recognized custom data attributes.
'Only comprising a prefix' => array( 'data-', '' ),
'With upper case ASCII' => array( 'data-Post-ID', 'postId' ),
'With medial upper casing' => array( 'data-uPPer-cAsE', 'upperCase' ),
'With Unicode whitespace' => array( "data-\u{2003}", "\u{2003}" ),
'With Emoji' => array( 'data-🐄-pasture', '🐄Pasture' ),
'Brackets and colon' => array( 'data-[wish:granted]', '[wish:granted]' ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that (surprisingly) all these cases do seem to work with way in the browser.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't know how to automate them in WordPress' tests, but I don't know if you saw that I included a test script in the PR description which I used to confirm each case. In PHP I use the old detector in WordPress, the one in this patch, and then in JavaScript I set the innerText to be the property name(s) of the available properties in an element's own dataset, which shows how the browser transforms them.

The included screenshot is from the output of this script, and I compared all three big browsers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the test script! Thanks for providing it 👍

I did something very similar myself. On my first attempt, I tried using DOM APIs to create an element and set the attribute. That (setAttribute) rejects some attribute names that worked fine when setting the data attributes in HTML.

@dmsnell
Copy link
Member Author

dmsnell commented May 23, 2024

@sirreal what are your thoughts on this vs. #6598? and also about @peterwilsoncc's thoughts.

I'm a little unclear as to the need for the new function as it's the attribute name that KSES is concerned about, the validation regex within the existing function should be for convertible names.

Personally I feel like I could be convinced both ways, but there's this part of me, especially after working with so many issues in WordPress' HTML handling and kses issues, that starting with the specification and then adding constraints is going to pay off over starting with a mix of special constraints and ad-hoc parsing.

That is, for example, maybe today kses rejects anything that looks like a custom data attribute, but tomorrow some other code tries to reuse the logic and misses some valid ones, or lets through one that it shouldn't, because its test wasn't "is this a custom data attribute" but rather "does this attribute follow this regex?" (By the way, @peterwilsoncc, this is another phrasing of my run-around reason for including the function - to define a new semantic which can answer the question "is this a custom data attribute?")

@sirreal
Copy link
Member

sirreal commented May 23, 2024

what are your thoughts on this vs. #6598?

I'm happy to see this PR move forward if we're happy with it, it seems more complete. My only intention with #6598 was to have a PR that was small and easy to land. I'm happy to close that PR.

@peterwilsoncc
Copy link
Contributor

@dmsnell I can see it leading to confusing if the new function doesn't return null for a value that will end up being stripped:

wp> wp_kses_transform_custom_data_attribute_name( "data-love-a-little-🐮-love-a-little-tippin'" )
=> string(34) "loveALittle-🐮LoveALittleTippin'"
wp> wp_kses_post( "<div data-love-a-little-🐮-love-a-little-tippin'>Yes, a musical theatre reference</div>" )
=> string(43) "<div>Yes, a musical theatre reference</div>"

As it's a kses function it suggests support that isn't available. Generally KSES sub-functions aren't intended to be called directly but that can change over time, as we've seen with safecss_filter_attr()

@dmsnell
Copy link
Member Author

dmsnell commented May 24, 2024

@peterwilsoncc 🤔 many of the kses sub-functions are there just for parsing, like wp_kses_hair and wp_kses_hair_parse and wp_kses_attr_parse

would it make more sense to add this into the HTML API and then have kses call it? it would have a clear home there as being a semantic of the HTML layer.

as a reminder, my purpose is to fully vet and adapt this solution for the feedback, not to push to adopt it or choose it in the end.

@peterwilsoncc
Copy link
Contributor

as a reminder, my purpose is to fully vet and adapt this solution for the feedback, not to push to adopt it or choose it in the end.

Yeah, I've got ya. We're working to the same end & it's a complex question :)

would it make more sense to add this into the HTML API and then have kses call it? it would have a clear home there as being a semantic of the HTML layer.

That works as the HTML API seems more about parsing without regard to kses. As the maintainer of the HTML API is that something you are happy with? If it's not something you had planned and don't think is a valid use of the feature then don't feel the need to wedge it in there on my account.

tomorrow some other code tries to reuse the logic and misses some valid ones, or lets through one that it shouldn't, because its test wasn't "is this a custom data attribute" but rather "does this attribute follow this regex?

Are you able to clarify this a little more?

I'm still trying to get clear on the purpose of the new function in my head and how it differs from the regex validation.

@dmsnell
Copy link
Member Author

dmsnell commented May 30, 2024

thanks again @peterwilsoncc, and sorry for my late response, which is due to some transit that threw me off.

That works as the HTML API seems more about parsing without regard to kses.

I'm happy to put this in there since it relates to the spec issues. Originally we had a WP_HTML_Spec class in the HTML API proposal but we removed it. It was basically an all-static class with knowledge like this from the HTML spec, and no other stateful or complicated logic.

The only challenge I see in moving into the HTML API right now is that it's not clear where to put it, and that would probably take some time to see how everyone feels.

As an alternative, we could inline the code for now and move it over when we have a stronger idea where it belongs. Or I guess we could still leave it as its own function so that we can leave in the tests for it, but communicate that it's not to be used or called from the outside, and that it will probably move. As in, maybe we can keep it here for now for practical reasons and defer the bigger design question until we can answer it.

Are you able to clarify this a little more?

Thank you for asking. I don't mean to be unclear, but I don't always get things out very well.

I could change the question here a bit. This entire patch and need to open up kses is complicated by the fact that it's unclear what Core wants or expects. Here's was the existing code claims…

/**
 * Allow `data-*` attributes.
 *
 * …
 * 
 * Note: the attribute name should only contain `A-Za-z0-9_-` chars,
 * double hyphens `--` are not accepted by WordPress.
 */

The existing code makes a claim for the purpose of the regex but then immediately contradicts that claim by specifying new rules. It's there to allow data-* attributes, but then also it's there to only allow some of them. It doesn't explain why these new restrictions are in place or who decided. It's evident that there are additional rules being imposed, but there's no explanation for a process to determine how and if it's safe to change those rules, or know if there's a bug in the original implementation.

For example, should Core allow data-________? It's allowed by the existing regex, but it looks wrong. What makes it wrong? Looks?

Getting back to your question, by separating the question into two steps I find it easier to disambiguate situations like these: is this a custom data attribute? is this one Core allows?

We have a clear goalpost for proper parsing, so we don't have to ask ourselves if the regex was intentionally overlooking other custom data attributes vs. whether it was accidentally overlooking them. Since this entire patch is about changing what WordPress allows, we can focus on WordPress additional rules and not conflate that with how to properly detect legitimate ones. We can change our own sanitization without ever effecting the underlying parsing code.

Anyway I hope I haven't made this more confusing. This is all about making it easier to talk about and verify that code we write is correct and about separating the act of detecting things from allowing some of those things we detect.

It's similar to using the HTML API to find IMG tags with a given class first by finding all of the legitimate IMG tags and then applying further rules, vs. attempting to find the subset directly with some pattern like ~<img src="https://[^"]+"[^>]>~ - it invites conflation of spec. vs. custom behaviors which open up opportunity to miss things.

Edit: I thought of a more concrete example I came upon that might help. In the test suite there's a case of data--before. It won't pass the old regex, but this doesn't have any dashes in it when read from element.dataset.Before. Now we can ask ourselves, was it intentional that Core rejected it, or did nobody think of this particular case? Another test case uses data--one--two--three in part to demonstrate how the dashes on the HTML attribute don't have a 1:1 correspondence with the translated name on the other side (it translate into One-Two--, where each double hyphen corresponds to zero, one, and two hyphens in JavaScript).

@peterwilsoncc
Copy link
Contributor

Thanks for clarifying.

I've been considering this over the weekend and my strong inclination is to go with the source code changes in PR #6598 to allow for the lightest touch changes to land in core.

Doing some quick testing, I can't see it leading to problems but it's worth coming up with additional edge cases. (Also doing quick tests, I can see that some of the browser interpretations are interesting https://jsbin.com/devefix/edit?html,js,console )

That said, if you can think of problematic cases it would lead to, I am happy to listen. It's best to know about these things pre- rather than post-commit.

The only challenge I see in moving into the HTML API right now is that it's not clear where to put it, and that would probably take some time to see how everyone feels.

As an alternative, we could inline the code for now and move it over when we have a stronger idea where it belongs. Or I guess we could still leave it as its own function so that we can leave in the tests for it, but communicate that it's not to be used or called from the outside, and that it will probably move. As in, maybe we can keep it here for now for practical reasons and defer the bigger design question until we can answer it.

If we're to validate the dataset, I'd be inclined to inline it for the time being if the intent is to deprecate in the shortish term.

In the past Core has removed private delegations because it didn't reflect reality (List Tables being the poster-child for this) so we'd need to keep it around. Inlining also allows a more considered approach to inclusion in the HTML API while you figure out where best to place it.

@dmsnell
Copy link
Member Author

dmsnell commented Jun 3, 2024

If you feel happy with #6598 then let's merge that in and leave this open for further consideration. The two aren't contradictory, but this one encompasses more choice. I'll approve that one, and if nobody has merged it by later today I'll try and do that. The most important thing to me is that we allow these where WordPress currently removes them for the Interactivity API's purposes.

In aa63e2b I added your additional test cases. Cases like these are exactly why I approached the problem the way I did, because intuition of the mapping from HTML to JavaScript can be fuzzy, meaning the rules we apply in PHP are in high risk for being different than we think they are unless we write the rules to the output of that transformation.

Thanks @peterwilsoncc!

@sirreal
Copy link
Member

sirreal commented Jun 25, 2024

We should update the trac ticket for this to https://core.trac.wordpress.org/ticket/61501. This would be an enhancement targeting the 6.7 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants