Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow skipping rel=canonical links (via config or built-in) #175

Open
nkuehn opened this issue Jul 1, 2020 · 5 comments
Open

Allow skipping rel=canonical links (via config or built-in) #175

nkuehn opened this issue Jul 1, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@nkuehn
Copy link

nkuehn commented Jul 1, 2020

We're stuck with a workflow problem that our previous link checker (ruby based HTMLProofer) solves with the ability to annotate specific DOM elements to be ignored by the link checker by adding a data-proofer-ignore attribute.

Here's the workflow issue:

  • The page has a <link rel="canonical" ...> head information that points to the live production domain's absolute URL. That link is checked by linkinator, too - which is generally the right approach and a nice feature that all sorts of URLs are checked.
  • I'm adding a new page in a branch. The branch is automatically built and tested on CI and deployed to a preview URL (vercel / netlify etc)
  • Test fails because it runs linkinator locally and the canonical tag points to the live domain which is not live because it's a preview. That's a logical problem and not a technical, so it's not a bug but rather something I would love to be able to configure.

Since the canonical link generation is not a potential source of broken links we would like to exclude it.

Any objections against contributing either an option to generally exclude the canonical tag if it's a local filesystem check and the canonical target is remote or to support an HTML attribute skip-link-checker that does not require configuration?

@nkuehn nkuehn changed the title Allow annotating specific links to be skipped (here: canonical link in the head) Allow skipping rel=canonical links (via config or built-in) Sep 18, 2020
@nkuehn
Copy link
Author

nkuehn commented Sep 18, 2020

@JustinBeckwith we would start working on a contribution right now but hesitate because it would be better to hear your opinion on the right approach.

We prefer a fixed convention to not check absolute <link rel="canonical" ...> links if the check is run on the local filesystem. They will by nature not be correct if you test a changed site locally.

Adding a config flag is good, too but if it's not needed It's a maintenance burden.

@JustinBeckwith JustinBeckwith added the enhancement New feature or request label Sep 18, 2020
@JustinBeckwith
Copy link
Owner

Hey @nkuehn, thank you for really digging into explaining your workflow, and what's going on here - and I really appreciate the offer to cut a PR! Thinking out loud - I would imagine that in most cases, wouldn't folks want to make sure the url used in <link rel="canonical" ...> is valid? That seems like the kind of thing we'd want to protect against in the general case. I want to make sure I fully understand it though.

On the solution space - right now I'm leaning in 1 of 2 directions. Naively - would it be possible for you to use the skip property, specifically targeting your canonical url here?
https://github.com/JustinBeckwith/linkinator/#command-usage

On the other hand - I do think there's value in providing a way to opt specific HTML elements out of link checking (as opposed to the link skip). Where I'm not sure on the right approach is the "how". The data-* is elegant, but kind of feels gross because I'm asking people to modify their sources directly so they can use the tool. The alternative idea bouncing around my head is an approach where we use the linkinator.config.json, and include a list of jquery style selectors in an array that can be opted out there.

Thoughts?

@nkuehn
Copy link
Author

nkuehn commented Sep 21, 2020

wouldn't folks want to make sure the url used in <link rel="canonical" ...> is valid?

Sure! .... if the build that they are just testing on the filesystem is not changing any addresses. If the build that's being tested on the filesystem is changing a page's address, it will fail because canononical links should be absolute and naturally it won't exist yet on the remote server (at least Google asks for that - "Use absolute paths rather than relative paths with the rel="canonical" link element.")


Naively - would it be possible for you to use the skip property, specifically targeting your canonical url here?

I would have to exclude my own production hostname for that. We actually do this right now to be able to use a canonical tag at all with linkinator. But it's coming with the unwanted side effect that every time some author uses an absolute URL it's not checked any more. Plus, some parts of the content under that domain belong to another system, so they have to be written as absolute links and should be link-checked.

Ideas

  1. If you lean to a powerful generic method, allowing the "skip" entries to be a cheerio selector over the current DOM node instead of only the URL that would be pretty nice. A quick look over the codebase gave me the impression that it could turn into a refactor because the cheerio "DOM object" is not passed around in the implementation.
  2. Alternatively, a global expression that defines what part of the document is to be crawled and checked at all could work, too. So if if want to exclude certain parts of the document something like (pseudocode) cheerio(..).not("link[rel='canonical']"). I would then have to pass link[rel='canonical'] to linkinator somehow. Could get messy with multiple such exclusions though.
  3. Lastly, more straightforward from the existing codebase: Make relValuesToIgnore (https://github.com/JustinBeckwith/linkinator/blob/master/src/links.ts#L52) overrideable in linkinator.config.json. The list of potential rel values ( https://developer.mozilla.org/en-US/docs/Web/HTML/Link_types ) is full of situations that are candidates for excluding from link checking anyways, so it's a middle ground between overly specific and very generic.

I like the last most TBH.

@nkuehn
Copy link
Author

nkuehn commented Oct 19, 2020

Did you have the opportunity to take a look at the idea I proposed in the PR?

@nkuehn
Copy link
Author

nkuehn commented Nov 30, 2021

FYI I closed the PR now, we're covering the problem with the new URL rewriting feature now. Rewriting the production domain to the current preview domain. It has side effects but for now does the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants