Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying HTTP request parameters #52

Open
diegorondini opened this issue Jul 15, 2022 · 10 comments
Open

Allow specifying HTTP request parameters #52

diegorondini opened this issue Jul 15, 2022 · 10 comments

Comments

@diegorondini
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Some URLs require specific HTTP request parameters.
One example is the github docs pages, for example this .md will fail:

$ cat mdtest.md 
= Test =

[Github docs link](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository)

$ mlc

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+                                                          +
+            markup link checker - mlc v0.15.2             +
+                                                          +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

[Err ] ./mdtest.md (3, 1) => https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository - 403 - Forbidden

Result (1 links):

OK       0
Skipped  0
Warnings 0
Errors   1


The following links could not be resolved:

./mdtest.md (3, 1) => https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository.

The reason is that the page requires specific HTTP headers:
community/community#14773

Describe the solution you'd like
It would be nice to have a way to specify HTTP request parameters, possibly per-URL.

@becheran
Copy link
Owner

I like this idea. Just don't know how exactly one would pass all the possible header fields to mlc? Via commandarg?

@diegorondini
Copy link
Contributor Author

diegorondini commented Jul 18, 2022

Probably the best option would be a config file, otherwise it would be impractical to specify different headers for different URLs.

See for example:
https://github.com/orgs/github-community/discussions/14773#discussioncomment-2679987
https://github.com/tcort/markdown-link-check#config-file-format

@diegorondini
Copy link
Contributor Author

I think your pipeline has been hit by this bug:
https://github.com/becheran/mlc/actions/runs/3559864946/jobs/5979511630

[Err ] ./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions - 403 - Forbidden
Error: https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions. 403 - Forbidden

@becheran
Copy link
Owner

@diegorondini fun fact: It does not fail when I run it locally. Does github somehow prevent requests to GitHub.com from their own runners? You mention missing request parameters? What would that be in this case?

@diegorondini
Copy link
Contributor Author

diegorondini commented Nov 28, 2022

@becheran I think the first question is why the pipeline checks that link even if there's no such link in the README.md:

$ grep 'docs\.github' README.md

Returning to this bug, docs.github.com requires the Accept-Encoding: zstd, br, gzip, deflate header:

$ curl -i -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 403 
x-azure-ref: 0wn2EYwAAAACr4P2HgpUzTatC1/nj5XnyTU5aMjIxMDYwNjEzMDIxADU5NmQ3OGEyLWNhNWYtNDc5ZC1iY2RjLTA4MzU4MzMxNzRiMg==
accept-ranges: bytes
via: 1.1 varnish, 1.1 varnish
date: Mon, 28 Nov 2022 09:22:10 GMT
x-served-by: cache-iad-kiad7000135-IAD, cache-mrs10563-MRS
x-cache: MISS, MISS
x-cache-hits: 0, 0
x-timer: S1669627330.213655,VS0,VE92
strict-transport-security: max-age=31557600

$ curl -i -H "Accept-Encoding: zstd, br, gzip, deflate" -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 200 
cache-control: public, max-age=60
content-type: text/html; charset=utf-8
access-control-allow-origin: *
content-security-policy: default-src 'none';prefetch-src 'self';connect-src 'self';font-src 'self' data: githubdocs.azureedge.net;img-src 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com data: githubdocs.azureedge.net placehold.it;object-src 'self';script-src 'self' data: githubdocs.azureedge.net;frame-src 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com https://www.youtube-nocookie.com;frame-ancestors 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com;style-src 'self' 'unsafe-inline' data: githubdocs.azureedge.net;child-src 'self';upgrade-insecure-requests;base-uri 'self';form-action 'self';script-src-attr 'none'
cross-origin-opener-policy: same-origin
cross-origin-resource-policy: same-origin
x-dns-prefetch-control: off
x-frame-options: SAMEORIGIN
x-download-options: noopen
x-content-type-options: nosniff
origin-agent-cluster: ?1
x-permitted-cross-domain-policies: none
referrer-policy: strict-origin-when-cross-origin
x-xss-protection: 0
x-powered-by: Next.js
x-azure-ref: 0hXyEYwAAAADMF8jkAx/XToTRxIg5u1m/UEhMMzBFREdFMDMxOQA1OTZkNzhhMi1jYTVmLTQ3OWQtYmNkYy0wODM1ODMzMTc0YjI=
content-encoding: br
via: 1.1 varnish, 1.1 varnish
accept-ranges: bytes
date: Mon, 28 Nov 2022 09:22:29 GMT
age: 335
x-served-by: cache-iad-kiad7000135-IAD, cache-mrs10583-MRS
x-cache: CONFIG_NOCACHE, HIT, HIT
x-cache-hits: 3, 1
x-timer: S1669627349.305248,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
content-length: 38324

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

@diegorondini
Copy link
Contributor Author

Sorry, I just realized I should have checked out the github-action-output branch.
Now it fails for me as well with 0.15.4:

$ mlc ./README.md

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+                                                          +
+            markup link checker - mlc v0.15.4             +
+                                                          +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

09:31:29 [WARN] Broken reference link: Borrowed("possible values: md, html")
09:31:29 [WARN] Strip everything after #. The chapter part '#ci-pipeline-integration' is not checked.
[ OK ] ./README.md (19, 8) => #ci-pipeline-integration - 
[ OK ] ./README.md (64, 1) => ./docs/FailingAnnotation.PNG - 
[ OK ] ./README.md (32, 28) => https://doc.rust-lang.org/cargo/ - 
[ OK ] ./README.md (4, 2) => https://badgen.net/crates/d/mlc?color=blue - 
[ OK ] ./README.md (46, 56) => https://github.com/marketplace/actions/markup-link-checker-mlc - 
[ OK ] ./README.md (20, 29) => https://rust-lang.github.io/async-book/ - 
[ OK ] ./README.md (3, 2) => https://img.shields.io/crates/v/mlc.svg?color=orange - 
[ OK ] ./README.md (9, 1) => https://asciinema.org/a/299100 - 
[ OK ] ./README.md (9, 2) => https://asciinema.org/a/299100.svg - 
[ OK ] ./README.md (6, 2) => https://img.shields.io/badge/License-MIT-yellow.svg - 
[ OK ] ./README.md (5, 2) => https://github.com/becheran/mlc/actions/workflows/rust.yml/badge.svg - 
[ OK ] ./README.md (7, 2) => https://img.shields.io/badge/PRs-welcome-brightgreen.svg - 
[ OK ] ./README.md (3, 1) => https://crates.io/crates/mlc - 
[ OK ] ./README.md (4, 1) => https://crates.io/crates/mlc - 
[ OK ] ./README.md (32, 92) => https://crates.io/crates/mlc - 
[Err ] ./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions - 403 - Forbidden
[ OK ] ./README.md (144, 60) => https://github.com/becheran/mlc/blob/master/LICENSE - 
[ OK ] ./README.md (75, 32) => https://github.com/becheran/ntest/blob/master/.github/workflows/ci.yml - 
[ OK ] ./README.md (79, 37) => https://hub.docker.com/repository/docker/becheran/mlc - 
[ OK ] ./README.md (140, 14) => https://github.com/becheran/mlc/blob/master/CHANGELOG.md - 
[ OK ] ./README.md (6, 1) => https://opensource.org/licenses/MIT - 
[ OK ] ./README.md (112, 221) => https://github.com/becheran/wildmatch - 
[ OK ] ./README.md (40, 54) => https://github.com/becheran/mlc/releases - 
[ OK ] ./README.md (5, 1) => https://github.com/becheran/mlc/actions/workflows/rust.yml - 
[ OK ] ./README.md (7, 1) => https://github.com/becheran/mlc/blob/master/CONTRIBUTING.md - 

Result (25 links):

OK       24
Skipped  0
Warnings 0
Errors   1


The following links could not be resolved:

./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions.

@becheran
Copy link
Owner

Ah, right. Did the same mistake and ran it on wrong branch locally 🤦‍♂️

@becheran
Copy link
Owner

@diegorondini would 'Accept-Encoding: *' help in this case? Might be a sane default since we don't care about the content anyways right now.

To make it configurable I think a map of links with wildcards and associated headers would make sense as config parameter. Will think about it.

@diegorondini
Copy link
Contributor Author

@becheran well, not literally:

$ curl -i -H "Accept-Encoding: *" -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 403
[...]

The official way to mean any encoding should be Accept-Encoding: */*, but I don't know how much it works in pratice.
https://stackoverflow.com/questions/25182888/does-in-an-http-accepts-encoding-header-mean-gzip-is-supported

The library you're using (reqwest?) may support accepting all encodings. Libcurl does that:
https://curl.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html

Not sure though if servers that don't support compression / encoding peacefully decline the "Accept-Encoding" header.

@becheran
Copy link
Owner

Yes, I am using reqwest. I did turn on all supported encodings (brotli, gzip, deflate) and that did the trick for now. But I guess there are other cases where a custom request is still required. For example if a authentication token is required for a specific link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants