Support for AsciiDoc (.adoc) files #291

lurch · 2021-08-02T09:04:37Z

Would it be possible to add support for checking links in AsciiDoc (*.adoc) files? It seems that lychee is currently unable to correctly parse asciidoc-style URLs and reports e.g.

✗ https://www.kernel.org/doc/Documentation/fb/framebuffer.txt[framebuffer] (HTTP status client error (404 Not Found) for url (https://www.kernel.org/doc/Documentation/fb/framebuffer.txt[framebuffer]))

when obviously the URL it should be checking is https://www.kernel.org/doc/Documentation/fb/framebuffer.txt

The reason I ask is that we're switching our documentation repo from markdown-source to asciidoc-source, and it would be nice to be able to run https://github.com/marketplace/actions/lychee-broken-link-checker directly against the asciidoc files.

The text was updated successfully, but these errors were encountered:

lebensterben · 2021-08-04T03:33:45Z

there are a few differences between asciidoc and asciidoctor, even though the later claimed to be another implementation of the former.

We should support both.

mre · 2021-08-20T09:19:01Z

Are there any good crates for AsciiDoc yet? Couldn't find any.

lebensterben · 2021-08-20T11:05:36Z

we can pandoc it to html

mre · 2021-08-20T12:22:42Z

Smart. Would we have to wrap the pandoc commandline tool for that?

lebensterben · 2021-08-20T16:20:26Z

@mre

First make it a optional feature, say "pandoc: Adds support for more file formats with help of 'pandoc'".

If the feature is enabled, then also package a binary release, which would not violate pandoc's GPL v2 license. And use this library in our code https://lib.rs/crates/pandoc

If the feature isn't enabled, but the user provided some known file formats that are only supported by pandoc, it should recommend them to use the optional 'pandoc' feature.

mre · 2021-08-20T17:05:19Z

That's cool. Yeah let's do this.

mre · 2022-01-13T00:33:22Z

How would we package the pandoc binary? In the Docker image it's pretty easy, but what about binary? Is there any build.rs file we could use as an example?
I'd argue that most people would use the Docker image for that or install pandoc themselves. We could add a check if it's in the PATH on startup.

lebensterben · 2022-01-13T00:45:44Z

@mre
You can for example write a shell script to grab the latest release from their repository.

mre · 2022-11-10T22:31:45Z

Just so we don't forget, there's also asciidoctor, which could be another binary to search for when trying to convert AsciiDoc. Here's an example for how to convert AsciiDoc to HTML: https://github.com/jeremyandrews/cio/blob/5c98cda6e31219e5063d591062086bbc370adbea/cio/src/rfds.rs#L95

lurch · 2022-11-11T00:34:13Z

I guess the downside of converting the AsciiDoc to HTML first, and then link-checking the HTML, is that any broken-link errors will be reported against the HTML output filenames, and not the original AsciiDoc input filenames? 🤔 🤷‍♂️
This will especially be a problem if the AsciiDoc sourcefiles make extensive use of the include directive (which we use a lot in https://github.com/raspberrypi/documentation/ ) .

mre · 2022-11-11T01:00:30Z

I see. That's a problem indeed. However we could track files that got converted in the process and print the original filenames instead. (We don't support printing the location of broken links within files so it should be fine.)
If include directives get handled by Asciidoctor when it converts to HTML then we should be good no?

lurch · 2022-11-11T01:08:10Z

If include directives get handled by Asciidoctor when it converts to HTML then we should be good no?

Consider if topfile.adoc includes file1.adoc and file2.adoc - when you convert to HTML this will get flattened into a single topfile.html. So if there was a broken link in file1.adoc, I guess your proposed solution would actually report it as being inside topfile.adoc, which wouldn't be right. (and obviously the more includes you use, see e.g. https://github.com/raw/raspberrypi/documentation/develop/documentation/asciidoc/computers/configuration.adoc , the worse this problem would become)

So perhaps in hindsight, the original request was outside the scope of what lychee could/should be able to do? 🤷

mre · 2022-11-11T10:51:57Z

You're right. That won't work.
I would really love to support AsciiDoc, but there are no Rust parsers that I'm aware of.
Link checking would be very rudimentary at best without that. 😞

lurch · 2022-11-11T11:17:53Z

You could just use some regexes? Probably no need to parse the whole thing "properly"?

mre · 2022-11-11T13:35:55Z

Honestly I'm scared of the edge-cases.

A regex-based solution might produce a lot of false-positives and these are worse than false-negatives in my experience: It's tedious for a user to filter out unwanted matches instead of missing a valid link.

Let me show you what I mean with an example.

To match links, the easiest case would be to start with a regular expression like

(.+)\[.*?\]

This would work fine in many situations like

https://chat.asciidoc.org[Discuss AsciiDoc,role=external,window=_blank]
https://chat.asciidoc.org[Discuss AsciiDoc^]
https://example.org/?q=%5Ba%20b%5D[URL with special characters]

but it would also match

[[bookmark-a]]Inline anchors make arbitrary content referenceable.

So one might think: "let's just ignore links with spaces"!

One could try

([^\s]+)\[.*?\]

This would work, but it would break in the following case:

link:++https://example.org/?q=[a b]++[URL with special characters]

Even if we could match on the last occurrence of square brackets before a string (which wouldn't be pretty),
it would still incorrectly match includes:

include::filename.txt[tag=definition]

To handle this case we'd need negative lookaround, which is not supported by the regex crate
and even if we did that with some post-processing ( s.trim_start_matches("include::"); it would not be very maintainable.

And these examples are just taken from the AsciiDoc quick reference. It's likely that there might be more edge-cases in real world examples.

Unless I'm missing something the best option might be to wait for a crate with proper parsing support or use an external tool and explicitly drop support for includes.

lurch · 2022-11-11T14:36:20Z

Honestly I'm scared of the edge-cases.
A regex-based solution might produce a lot of false-positives and these are worse than false-negatives in my experience

Yeah, that's fair enough. Edge-cases always turn out to be a lot hairier than you expect 😆

If it makes any difference, I was only expecting lychee to check external links (e.g. those starting with http:// or https://) rather than also checking internal cross-references; but even that would probably fall foul of the edge-cases you've identified.

Feel free to close this issue if you think that it's out of scope for this project.

mre · 2022-11-11T15:10:11Z

Let's keep it open for the time being because I'd still love to support it. Maybe there will be an AsciiDoc crate in the future.

mre added the enhancement New feature or request label Aug 20, 2021

mre added good first issue Good for newcomers help wanted Extra attention is needed labels Feb 4, 2022

mre mentioned this issue Mar 28, 2022

Support epub format #202

Open

mre mentioned this issue Jul 18, 2022

Support for more formats (XML, JSON) #696

Closed

mre mentioned this issue Nov 10, 2022

mkdir error lycheeverse/lychee-action#174

Closed

mre added waiting-for-upstream and removed good first issue Good for newcomers help wanted Extra attention is needed labels Nov 11, 2022

nathany mentioned this issue Dec 3, 2023

Improve reporting of redirects and excluded links #1312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for AsciiDoc (.adoc) files #291

Support for AsciiDoc (.adoc) files #291

lurch commented Aug 2, 2021

lebensterben commented Aug 4, 2021

mre commented Aug 20, 2021

lebensterben commented Aug 20, 2021

mre commented Aug 20, 2021

lebensterben commented Aug 20, 2021

mre commented Aug 20, 2021

mre commented Jan 13, 2022

lebensterben commented Jan 13, 2022

mre commented Nov 10, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022

Support for AsciiDoc (.adoc) files #291

Support for AsciiDoc (.adoc) files #291

Comments

lurch commented Aug 2, 2021

lebensterben commented Aug 4, 2021

mre commented Aug 20, 2021

lebensterben commented Aug 20, 2021

mre commented Aug 20, 2021

lebensterben commented Aug 20, 2021

mre commented Aug 20, 2021

mre commented Jan 13, 2022

lebensterben commented Jan 13, 2022

mre commented Nov 10, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022

lurch commented Nov 11, 2022

mre commented Nov 11, 2022