Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ImageCDN parsing to rely on HTML API instead of RegExps #32700

Closed
wants to merge 1 commit into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Aug 26, 2023

Status

This is a work in progress and isn't tested or verified. This has been reviewed, but the filters aren't tested because it's unclear what code might rely on them.

Due to the change in indentation the diff view is more associated with the actual changes if ignoring whitespace.

Proposed changes:

The introduction of the HTML API into WordPress 6.2 offers a new method of matching and modifying HTML. In this patch we're replacing code that attempts to parse the input HTML and extract images that are direct children of an anchor ("A" tag), then read and modify them based on the values of their attributes and computed Photon properties.

In the previous code the Image_CDN class scanned the entire HTML document to generate a list of PREG image match objects, then iterated over those matches and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along the way, and making the appropriate modifications. Extra care is taken to ensure that only images that are the single child of a link are matched.

In this change the values of the tag key in some of the filters has changed from the initial matched HTML snippet to the name of the image tag, which could be IMG or AMP-IMG or AMP-ANIM. An update to the Tag Processor or a custom sub-class thereof could provide the original HTML snippet and match the existing behavior, but that hasn't been done in this patch yet given the author's uncertainty about the use and value of those snippets.

Other information:

  • Have you written new tests for your changes, if applicable?
  • Have you checked the E2E test CI results, and verified that your changes do not break them?
  • Have you tested your changes on WordPress.com, if applicable (if so, you'll see a generated comment below with a script to run)?

Jetpack product discussion

This mandates running on WordPress 6.2. If we prefer being able to run on older versions, we could also include a copy of the Tag Processor in the plugin.

Does this pull request change what data or activity we track or use?

No.

Testing instructions:

Test suite should pass. Manual review is necessary though.
I don't even know what all the function I modified is supposed to do fully, so code auditing and review of the modifications is critical.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 26, 2023

Are you an Automattician? Please test your changes on all WordPress.com environments to help mitigate accidental explosions.

  • To test on WoA, go to the Plugins menu on a WordPress.com Simple site. Click on the "Upload" button and follow the upgrade flow to be able to upload, install, and activate the Jetpack Beta plugin. Once the plugin is active, go to Jetpack > Jetpack Beta, select your plugin, and enable the image-cdn/rely-on-html-api branch.

  • To test on Simple, run the following command on your sandbox:

    bin/jetpack-downloader test jetpack image-cdn/rely-on-html-api
    

Interested in more tips and information?

  • In your local development environment, use the jetpack rsync command to sync your changes to a WoA dev blog.
  • Read more about our development workflow here: PCYsg-eg0-p2
  • Figure out when your changes will be shipped to customers here: PCYsg-eg5-p2

@github-actions github-actions bot added [Feature] Photon aka "Image CDN". Feature developed in the Image CDN package and shipped in multiple plugins [Package] Image CDN [Status] In Progress labels Aug 26, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 26, 2023

Thank you for your PR!

When contributing to Jetpack, we have a few suggestions that can help us test and review your patch:

  • ✅ Include a description of your PR changes.
  • ✅ Add a "[Status]" label (In Progress, Needs Team Review, ...).
  • ✅ Add testing instructions.
  • ✅ Specify whether this PR includes any changes to data or privacy.
  • ✅ Add changelog entries to affected projects

This comment will be updated as you work on your PR and make changes. If you think that some of those checks are not needed for your PR, please explain why you think so. Thanks for cooperation 🤖


The e2e test report can be found here. Please note that it can take a few minutes after the e2e tests checks are complete for the report to be available.


Once your PR is ready for review, check one last time that all required checks appearing at the bottom of this PR are passing or skipped.
Then, add the "[Status] Needs Team Review" label and ask someone from your team review the code. Once reviewed, it can then be merged.
If you need an extra review from someone familiar with the codebase, you can update the labels from "[Status] Needs Team Review" to "[Status] Needs Review", and in that case Jetpack Approvers will do a final review of your PR.

@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch 6 times, most recently from ba4718b to 564599c Compare August 26, 2023 01:27
@github-actions github-actions bot added the Docs label Aug 29, 2023
@dmsnell dmsnell force-pushed the image-cdn/rely-on-html-api branch 2 times, most recently from 6b724e7 to 583e9ca Compare August 29, 2023 02:28
@jeherve jeherve mentioned this pull request Aug 29, 2023
16 tasks
Copy link
Member

@jeherve jeherve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for proposing this! This is a nice way to use WP_HTML_Tag_Processor, and something we can do now that we'll be requiring WordPress 6.2 (#31638).

I have not tested your PR, but left a couple of comments below about how things work in the monorepo, if that can help you discover how we usually do things :)

projects/packages/image-cdn/src/class-image-cdn.php Outdated Show resolved Hide resolved
projects/packages/image-cdn/CHANGELOG.md Outdated Show resolved Hide resolved
@oskosk
Copy link
Contributor

oskosk commented Aug 29, 2023

@haqadn @Automattic/heart-of-gold can you make sure you keep an eye on this one?

@dmsnell dmsnell changed the title WIP: Refactor ImageCDN parsing to rely on HTML API instead of RegExps Refactor ImageCDN parsing to rely on HTML API instead of RegExps Aug 29, 2023
@dmsnell dmsnell marked this pull request as ready for review September 4, 2023 20:54
@dmsnell
Copy link
Member Author

dmsnell commented Sep 5, 2023

@oskosk @jeherve should I try and find someone to review this? are either of you happy with the proposed change?

@dmsnell
Copy link
Member Author

dmsnell commented Sep 5, 2023

One note I should leave here is to ask about pulling in Core's assertEqualMarkup() function. I wanted to add it to the BaseTestCase but then I saw that's an external class and wasn't sure where to add it so that it would be available all around Jetpack.

The issue here is that the test output changes but in a semantically neutral way. I could rewrite the test so that it hardcodes the newer ordering of output, but that would leave this fragility in place for the future.

Copy link
Member

@jeherve jeherve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I try and find someone to review this? are either of you happy with the proposed change?

Maybe @Automattic/heart-of-gold could take a look?

/**
* Allow specific images to be skipped by Photon.
*
* @TODO: Does this need to pass the full HTML of the image tag?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at usage of this filter in other plugins in the WordPress.org plugin directory. While most plugins do not use this third parameter, there are a few that do, sometimes to look for a specific class name in the HTML. So I think we should keep this available if possible.

*
* @see https://developer.wordpress.com/docs/photon/api/
* @TODO: What is the point of passing $images and $index. Are they required?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around but couldn't figure out where that filter is being used today. It was used in the past, but may not be used anymore today.

haqadn
haqadn previously approved these changes Sep 12, 2023
Copy link
Contributor

@haqadn haqadn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look good to me and no problems found after testing. 👍🏼

P.S.: I only worked on migrating from a Jetpack module to image-cdn package. I don't have prior knowledge of the implementation details on the core functionality.

@dmsnell
Copy link
Member Author

dmsnell commented Sep 13, 2023

I will reconstruct the existing filter value before merge.
Would still love to have some thoughts on the assertEquivalentMarkup() method from Core, if there's a better place to put that.

@@ -178,6 +178,51 @@ protected function helper_remove_image_sizes() {
remove_image_size( 'jetpack_soft_oversized_after_upload' );
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these functions are okay here.

Personally, I would put them in the parent class at class-image-cdn-attachment-test-case.php but, it looks like this is the only (current) test that extends this particular class.

@kraftbj
Copy link
Contributor

kraftbj commented Sep 14, 2023

Since we're late in the beta testing week, I'd say let's merge this next week after the branch cut so we get the longest amount of time to test it in the wild before the next dotorg release

Copy link
Contributor

This PR has been marked as stale. This happened because:

  • It has been inactive for the past 3 months.
  • It hasn’t been labeled `[Pri] BLOCKER`, `[Pri] High`, etc.

If this PR is still useful, please do a [trunk merge or rebase](https://github.com/Automattic/jetpack/blob/trunk/docs/git-workflow.md#keeping-your-branch-up-to-date) and otherwise make sure it's up to date and has clear testing instructions. You may also want to ping possible reviewers in case they've forgotten about it. Please close this PR if you think it's not valid anymore — if you do, please add a brief explanation.

If the PR is not updated (or at least commented on) in another month, it will be automatically closed.

@dmsnell
Copy link
Member Author

dmsnell commented Dec 14, 2023

whoops, was I supposed to merge this? I'll try and rebase the branch and get it ready again without changing the code, then maybe we can merge next week or so.

thingalon
thingalon previously approved these changes Dec 15, 2023
Copy link
Member

@thingalon thingalon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just went through filter_the_content line-by-line, comparing its functionality to the original. Well done, I think you have nailed it!

I do have a couple of minor questions and suggestions - but absolutely none of them are blockers in my opinion so I'm giving this a ✅ on behalf of Heart of Gold.

Should we also remove parse_images_from_html which now only appears in its test suite?

return null;
}

$search_pattern = sprintf( '#(?:^|[ \t\f\r\n])%s(?P<value>[^ \t\f\r\n]+)#', preg_quote( $prefix, null ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style question: Is there a difference between [\s] and [ \t\f\r\n]? If not, would it be more readable using \s?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question! I'm so used to the HTML API where we've been very explicit about these that I didn't think about it, but the PHP docs suggest that \s may be affected by locale options. I think it would be just as good to leave this as-is, though \s is unlikely to be a big problem in practice so if you prefer it I'm happy to make that change.

apparently it may include additional unicode characters representing whitespace if those locale rules are active, which would diverge from how the CSS class names are parsed in the browser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode can't let anything be simple, not even whitespace.
What you've written would be closer to the HTML spec, so let's stick with how you've done it.

* @param string $tag Image Tag (Image HTML output).
*/
if ( apply_filters( 'jetpack_photon_skip_image', false, $src, $tag ) ) {
// @TODO: Do "PICTURE" and "SOURCE" belong here as well?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe we should target these. However, I would do it in a separate PR to keep the reviewing/auditing easier, and less likely to produce surprises.

Comment on lines 496 to 512
$size = self::parse_class_name_prefixed_by( $class, 'size-' );
if ( null !== $size && false === $width && false === $height && 'full' !== $size && array_key_exists( $size, $image_sizes ) ) {
$width = (int) $image_sizes[ $size ]['width'];
$height = (int) $image_sizes[ $size ]['height'];
$transform = $image_sizes[ $size ]['crop'] ? 'resize' : 'fit';
} else {
unset( $size );
}
Copy link
Member

@thingalon thingalon Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor point, but can we skip calling parse_class_name_prefixed_by entirely if width or height are not false, skipping some regexes per-image where sizes are explicitly set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, why do more work than we need to?

dmsnell added a commit that referenced this pull request Dec 23, 2023
)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
@dmsnell
Copy link
Member Author

dmsnell commented Dec 23, 2023

All, thanks for your feedback; I've finally updated this PR and I think I have addressed everything. It now respects the old value for the $tag parameter in the jetpack_photon_skip_image filter.

dmsnell added a commit that referenced this pull request Dec 23, 2023
)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
haqadn
haqadn previously approved these changes Jan 8, 2024
Copy link
Contributor

@haqadn haqadn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed call to WP_HTML_Tag_Processor by using the correct case. Looks like the PR is ready to get merged.

)

The introduction of the HTML API into WordPress 6.2 offers a new method of
matching and modifying HTML. In this patch we're replacing code that attempts
to parse the input HTML and visit all images, then read and modify them based
on the values of their attributes and computed Photon properties.

In the previous code the `Image_CDN` class scanned the entire HTML document to
generate a list of PREG image match objects, then iterated over those matches
and performed string-replace operations on them.

Now the class does a pass from start to finish, visting each image tag along
the way, and making the appropriate modifications.

Co-authored-by: Adnan Haque <3737780+haqadn@users.noreply.github.com>
Co-authored-by: Brandon Kraft <public@brandonkraft.com>
Co-authored-by: Jeremy Herve <jeremy@jeremy.hu>
Co-authored-by: Mark George <thingalon@gmail.com>
Co-authored-by: Osk <oskosk@users.noreply.github.com>
@dmsnell
Copy link
Member Author

dmsnell commented Feb 2, 2024

thanks @haqadn. sorry all I've been holding out on this so I can try and test it before merging. too much to do!

Copy link
Contributor

github-actions bot commented May 5, 2024

This PR has been marked as stale. This happened because:

  • It has been inactive for the past 3 months.
  • It hasn’t been labeled `[Pri] BLOCKER`, `[Pri] High`, etc.

If this PR is still useful, please do a [trunk merge or rebase](https://github.com/Automattic/jetpack/blob/trunk/docs/git-workflow.md#keeping-your-branch-up-to-date) and otherwise make sure it's up to date and has clear testing instructions. You may also want to ping possible reviewers in case they've forgotten about it. Please close this PR if you think it's not valid anymore — if you do, please add a brief explanation.

If the PR is not updated (or at least commented on) in another month, it will be automatically closed.

Copy link
Contributor

This PR has been automatically closed as it has not been updated in some time. If you want to resume work on the PR, feel free to restore the branch and reopen the PR.

@github-actions github-actions bot closed this Jun 17, 2024
@github-actions github-actions bot deleted the image-cdn/rely-on-html-api branch June 17, 2024 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs [Feature] Photon aka "Image CDN". Feature developed in the Image CDN package and shipped in multiple plugins [Package] Image CDN [Status] In Progress [Status] Stale [Tests] Includes Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants