Allow skipping rel=canonical links (via config or built-in) #175

nkuehn · 2020-07-01T07:38:46Z

We're stuck with a workflow problem that our previous link checker (ruby based HTMLProofer) solves with the ability to annotate specific DOM elements to be ignored by the link checker by adding a data-proofer-ignore attribute.

Here's the workflow issue:

The page has a <link rel="canonical" ...> head information that points to the live production domain's absolute URL. That link is checked by linkinator, too - which is generally the right approach and a nice feature that all sorts of URLs are checked.
I'm adding a new page in a branch. The branch is automatically built and tested on CI and deployed to a preview URL (vercel / netlify etc)
Test fails because it runs linkinator locally and the canonical tag points to the live domain which is not live because it's a preview. That's a logical problem and not a technical, so it's not a bug but rather something I would love to be able to configure.

Since the canonical link generation is not a potential source of broken links we would like to exclude it.

Any objections against contributing either an option to generally exclude the canonical tag if it's a local filesystem check and the canonical target is remote or to support an HTML attribute skip-link-checker that does not require configuration?

The text was updated successfully, but these errors were encountered:

nkuehn · 2020-09-18T13:50:16Z

@JustinBeckwith we would start working on a contribution right now but hesitate because it would be better to hear your opinion on the right approach.

We prefer a fixed convention to not check absolute <link rel="canonical" ...> links if the check is run on the local filesystem. They will by nature not be correct if you test a changed site locally.

Adding a config flag is good, too but if it's not needed It's a maintenance burden.

JustinBeckwith · 2020-09-18T15:16:35Z

Hey @nkuehn, thank you for really digging into explaining your workflow, and what's going on here - and I really appreciate the offer to cut a PR! Thinking out loud - I would imagine that in most cases, wouldn't folks want to make sure the url used in <link rel="canonical" ...> is valid? That seems like the kind of thing we'd want to protect against in the general case. I want to make sure I fully understand it though.

On the solution space - right now I'm leaning in 1 of 2 directions. Naively - would it be possible for you to use the skip property, specifically targeting your canonical url here?
https://github.com/JustinBeckwith/linkinator/#command-usage

On the other hand - I do think there's value in providing a way to opt specific HTML elements out of link checking (as opposed to the link skip). Where I'm not sure on the right approach is the "how". The data-* is elegant, but kind of feels gross because I'm asking people to modify their sources directly so they can use the tool. The alternative idea bouncing around my head is an approach where we use the linkinator.config.json, and include a list of jquery style selectors in an array that can be opted out there.

Thoughts?

nkuehn · 2020-09-21T19:48:50Z

wouldn't folks want to make sure the url used in <link rel="canonical" ...> is valid?

Sure! .... if the build that they are just testing on the filesystem is not changing any addresses. If the build that's being tested on the filesystem is changing a page's address, it will fail because canononical links should be absolute and naturally it won't exist yet on the remote server (at least Google asks for that - "Use absolute paths rather than relative paths with the rel="canonical" link element.")

Naively - would it be possible for you to use the skip property, specifically targeting your canonical url here?

I would have to exclude my own production hostname for that. We actually do this right now to be able to use a canonical tag at all with linkinator. But it's coming with the unwanted side effect that every time some author uses an absolute URL it's not checked any more. Plus, some parts of the content under that domain belong to another system, so they have to be written as absolute links and should be link-checked.

Ideas

If you lean to a powerful generic method, allowing the "skip" entries to be a cheerio selector over the current DOM node instead of only the URL that would be pretty nice. A quick look over the codebase gave me the impression that it could turn into a refactor because the cheerio "DOM object" is not passed around in the implementation.
Alternatively, a global expression that defines what part of the document is to be crawled and checked at all could work, too. So if if want to exclude certain parts of the document something like (pseudocode) cheerio(..).not("link[rel='canonical']"). I would then have to pass link[rel='canonical'] to linkinator somehow. Could get messy with multiple such exclusions though.
Lastly, more straightforward from the existing codebase: Make relValuesToIgnore (https://github.com/JustinBeckwith/linkinator/blob/master/src/links.ts#L52) overrideable in linkinator.config.json. The list of potential rel values ( https://developer.mozilla.org/en-US/docs/Web/HTML/Link_types ) is full of situations that are candidates for excluding from link checking anyways, so it's a middle ground between overly specific and very generic.

I like the last most TBH.

nkuehn · 2020-10-19T16:09:41Z

Did you have the opportunity to take a look at the idea I proposed in the PR?

nkuehn · 2021-11-30T08:10:18Z

FYI I closed the PR now, we're covering the problem with the new URL rewriting feature now. Rewriting the production domain to the current preview domain. It has side effects but for now does the job.

nkuehn mentioned this issue Jul 1, 2020

feat(theme): allow to optionally enable canonical urls commercetools/commercetools-docs-kit#531

Closed

nkuehn changed the title ~~Allow annotating specific links to be skipped (here: canonical link in the head)~~ Allow skipping rel=canonical links (via config or built-in) Sep 18, 2020

JustinBeckwith added the enhancement New feature or request label Sep 18, 2020

davifantasia mentioned this issue Sep 21, 2020

Contribute skipping rel=canonical to linkinator commercetools/commercetools-docs-kit#659

Open

nkuehn mentioned this issue Sep 24, 2020

feat: option to customize the skipped link relation types #180

Closed

Trott mentioned this issue Jan 4, 2022

doc: jan-2022 sec release pre-announcement nodejs/nodejs.org#4344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow skipping rel=canonical links (via config or built-in) #175

Allow skipping rel=canonical links (via config or built-in) #175

nkuehn commented Jul 1, 2020 •

edited

Loading

nkuehn commented Sep 18, 2020

JustinBeckwith commented Sep 18, 2020

nkuehn commented Sep 21, 2020

nkuehn commented Oct 19, 2020

nkuehn commented Nov 30, 2021

Allow skipping rel=canonical links (via config or built-in) #175

Allow skipping rel=canonical links (via config or built-in) #175

Comments

nkuehn commented Jul 1, 2020 • edited Loading

nkuehn commented Sep 18, 2020

JustinBeckwith commented Sep 18, 2020

nkuehn commented Sep 21, 2020

nkuehn commented Oct 19, 2020

nkuehn commented Nov 30, 2021

nkuehn commented Jul 1, 2020 •

edited

Loading