Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for AsciiDoc (.adoc) files #291

Open
lurch opened this issue Aug 2, 2021 · 17 comments
Open

Support for AsciiDoc (.adoc) files #291

lurch opened this issue Aug 2, 2021 · 17 comments
Labels

Comments

@lurch
Copy link

lurch commented Aug 2, 2021

Would it be possible to add support for checking links in AsciiDoc (*.adoc) files? It seems that lychee is currently unable to correctly parse asciidoc-style URLs and reports e.g.

https://www.kernel.org/doc/Documentation/fb/framebuffer.txt[framebuffer] (HTTP status client error (404 Not Found) for url (https://www.kernel.org/doc/Documentation/fb/framebuffer.txt[framebuffer]))

when obviously the URL it should be checking is https://www.kernel.org/doc/Documentation/fb/framebuffer.txt

The reason I ask is that we're switching our documentation repo from markdown-source to asciidoc-source, and it would be nice to be able to run https://github.com/marketplace/actions/lychee-broken-link-checker directly against the asciidoc files.

@lebensterben
Copy link
Member

there are a few differences between asciidoc and asciidoctor, even though the later claimed to be another implementation of the former.

We should support both.

@mre
Copy link
Member

mre commented Aug 20, 2021

Are there any good crates for AsciiDoc yet? Couldn't find any.

@mre mre added the enhancement New feature or request label Aug 20, 2021
@lebensterben
Copy link
Member

we can pandoc it to html

@mre
Copy link
Member

mre commented Aug 20, 2021

Smart. Would we have to wrap the pandoc commandline tool for that?

@lebensterben
Copy link
Member

@mre

First make it a optional feature, say "pandoc: Adds support for more file formats with help of 'pandoc'".

If the feature is enabled, then also package a binary release, which would not violate pandoc's GPL v2 license. And use this library in our code https://lib.rs/crates/pandoc

If the feature isn't enabled, but the user provided some known file formats that are only supported by pandoc, it should recommend them to use the optional 'pandoc' feature.

@mre
Copy link
Member

mre commented Aug 20, 2021

That's cool. Yeah let's do this.

@mre
Copy link
Member

mre commented Jan 13, 2022

How would we package the pandoc binary? In the Docker image it's pretty easy, but what about binary? Is there any build.rs file we could use as an example?
I'd argue that most people would use the Docker image for that or install pandoc themselves. We could add a check if it's in the PATH on startup.

@lebensterben
Copy link
Member

@mre
You can for example write a shell script to grab the latest release from their repository.

@mre
Copy link
Member

mre commented Nov 10, 2022

Just so we don't forget, there's also asciidoctor, which could be another binary to search for when trying to convert AsciiDoc. Here's an example for how to convert AsciiDoc to HTML: https://github.com/jeremyandrews/cio/blob/5c98cda6e31219e5063d591062086bbc370adbea/cio/src/rfds.rs#L95

@lurch
Copy link
Author

lurch commented Nov 11, 2022

I guess the downside of converting the AsciiDoc to HTML first, and then link-checking the HTML, is that any broken-link errors will be reported against the HTML output filenames, and not the original AsciiDoc input filenames? 🤔 🤷‍♂️
This will especially be a problem if the AsciiDoc sourcefiles make extensive use of the include directive (which we use a lot in https://github.com/raspberrypi/documentation/ ) .

@mre
Copy link
Member

mre commented Nov 11, 2022

I see. That's a problem indeed. However we could track files that got converted in the process and print the original filenames instead. (We don't support printing the location of broken links within files so it should be fine.)
If include directives get handled by Asciidoctor when it converts to HTML then we should be good no?

@lurch
Copy link
Author

lurch commented Nov 11, 2022

If include directives get handled by Asciidoctor when it converts to HTML then we should be good no?

Consider if topfile.adoc includes file1.adoc and file2.adoc - when you convert to HTML this will get flattened into a single topfile.html. So if there was a broken link in file1.adoc, I guess your proposed solution would actually report it as being inside topfile.adoc, which wouldn't be right. (and obviously the more includes you use, see e.g. https://github.com/raw/raspberrypi/documentation/develop/documentation/asciidoc/computers/configuration.adoc , the worse this problem would become)

So perhaps in hindsight, the original request was outside the scope of what lychee could/should be able to do? 🤷

@mre
Copy link
Member

mre commented Nov 11, 2022

You're right. That won't work.
I would really love to support AsciiDoc, but there are no Rust parsers that I'm aware of.
Link checking would be very rudimentary at best without that. 😞

@lurch
Copy link
Author

lurch commented Nov 11, 2022

You could just use some regexes? Probably no need to parse the whole thing "properly"?

@mre
Copy link
Member

mre commented Nov 11, 2022

Honestly I'm scared of the edge-cases.

A regex-based solution might produce a lot of false-positives and these are worse than false-negatives in my experience: It's tedious for a user to filter out unwanted matches instead of missing a valid link.

Let me show you what I mean with an example.

To match links, the easiest case would be to start with a regular expression like

(.+)\[.*?\]

This would work fine in many situations like

https://chat.asciidoc.org[Discuss AsciiDoc,role=external,window=_blank]
https://chat.asciidoc.org[Discuss AsciiDoc^]
https://example.org/?q=%5Ba%20b%5D[URL with special characters]

but it would also match

[[bookmark-a]]Inline anchors make arbitrary content referenceable.

So one might think: "let's just ignore links with spaces"!

One could try

([^\s]+)\[.*?\]

This would work, but it would break in the following case:

link:++https://example.org/?q=[a b]++[URL with special characters]

Even if we could match on the last occurrence of square brackets before a string (which wouldn't be pretty),
it would still incorrectly match includes:

include::filename.txt[tag=definition]

To handle this case we'd need negative lookaround, which is not supported by the regex crate
and even if we did that with some post-processing ( s.trim_start_matches("include::"); it would not be very maintainable.

And these examples are just taken from the AsciiDoc quick reference. It's likely that there might be more edge-cases in real world examples.

Unless I'm missing something the best option might be to wait for a crate with proper parsing support or use an external tool and explicitly drop support for includes.

@lurch
Copy link
Author

lurch commented Nov 11, 2022

Honestly I'm scared of the edge-cases.
A regex-based solution might produce a lot of false-positives and these are worse than false-negatives in my experience

Yeah, that's fair enough. Edge-cases always turn out to be a lot hairier than you expect 😆

If it makes any difference, I was only expecting lychee to check external links (e.g. those starting with http:// or https://) rather than also checking internal cross-references; but even that would probably fall foul of the edge-cases you've identified.

Feel free to close this issue if you think that it's out of scope for this project.

@mre
Copy link
Member

mre commented Nov 11, 2022

Let's keep it open for the time being because I'd still love to support it. Maybe there will be an AsciiDoc crate in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants