Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Lean 4 #6616

Merged
merged 19 commits into from
Dec 11, 2023
Merged

Add support for Lean 4 #6616

merged 19 commits into from
Dec 11, 2023

Conversation

eric-wieser
Copy link
Contributor

@eric-wieser eric-wieser commented Nov 21, 2023

Description

Checklist:

@eric-wieser eric-wieser marked this pull request as ready for review November 21, 2023 17:11
@eric-wieser eric-wieser requested a review from a team as a code owner November 21, 2023 17:11
Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Lean 4 really so syntactically different from the Lean we already support that it warrants its own entry, samples and grammar?

If this is just a case of getting improved syntax support for Lean 4, does the newer Lean 4 grammar also support earlier Lean versions? If so, it would be better to switch out just the grammar for the newer grammar.

If not, we can add support for Lean 4 as its own entry, but I recommend grouping it with the current Lean (add group: Lean), but note, this PR doesn't meet usage requirements so won't be merged until it does.

You also need to finish off the section you're adding to the languages.yml file as per the CONTRIBUTING.md file.

You will also need to ensure there are at least two samples for both Leans to train the classifier and ensure it can correctly identify the languages of the 4+ samples.

@eric-wieser
Copy link
Contributor Author

Is Lean 4 really so syntactically different from the Lean we already support that it warrants its own entry, samples and grammar?

I'd argue yes. The syntax changes are much greater than the change from Python 2 to Python 3 (to give a hopefully more familiar example).

does the newer Lean 4 grammar also support earlier Lean versions?

No, nor is there any intent that this will be supported in future. Lean 3 -> Lean 4 is an almost complete rewrite of the language that was years in the making, and it is not expected for this to happen again any time soon.

but note, this PR doesn't meet usage requirements so won't be merged until it does.

Can you explain what the requirement is that we do not meet? https://github.com/topics/lean4 suggests that we have >200 repos, though perhaps that counts forks.

You will also need to ensure there are at least two samples for both Leans to train the classifier and ensure it can correctly identify the languages of the 4+ samples.

I've added one more of each; could you point me to any tests I should add for this? Also note that CI will not run on this PR without you hitting the button that might as well read "this user is not trying to hack our CI".

@lildude
Copy link
Member

lildude commented Nov 22, 2023

Can you explain what the requirement is that we do not meet? https://github.com/topics/lean4 suggests that we have >200 repos, though perhaps that counts forks.

#5756 For the latest, as referenced in the CONTRIBUTING.md file.

I've added one more of each; could you point me to any tests I should add for this?

The only additional tests would be for the heuristics which you've already added. The testing of the classifier is this one which can be run locally (or in Codespaces/devcontainer) too.

Also note that CI will not run on this PR without you hitting the button that might as well read "this user is not trying to hack our CI".

This is only because you're a new contributor but you should still be able to run the tests locally with bundle exec rake test too. I've clicked the magic button anyway.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the template to list the source and licenses of all the samples you've added.

lib/linguist/languages.yml Show resolved Hide resolved
grammars.yml Outdated Show resolved Hide resolved
@lildude
Copy link
Member

lildude commented Nov 22, 2023

The classifier test failure is because your Lean and Lean 4 samples are not sufficiently different for the classifier to differentiate between the two languages for samples/Lean/binders.lean suggesting the Lean 4 samples do not contain enough of the unique features of the language or this file (and the other Lean samples) includes features nothing unique to Lean. Replacing them or adding more representative and illustrative, but real world samples with unique features (we don't want contrived samples) may help.

This could take a bit of time but you can speed up the local testing using bundle exec script/cross-validation --extensions=.lean.

The other test failure is clear.

@eric-wieser
Copy link
Contributor Author

eric-wieser commented Nov 22, 2023

Thanks! Regarding the classification failure; is this using either the heuristics or the textmate grammars to tell the two apart? Or will adjusting those have no effect on the success rate?

(and if it's not using the heuristics, what is the point of these?)

@lildude
Copy link
Member

lildude commented Nov 22, 2023

Thanks! Regarding the classification failure; is this using either the heuristics or the textmate grammars to tell the two apart? Or will adjusting those have no effect on the success rate?

The classifier is using neither. It is tokenising the contents of the samples and then using that to perform a bayesian classification against the samples.

(and if it's not using the heuristics, what is the point of these?)

It's to catch those situations where the heuristics don't find any matches. If you're 💯 certain your heuristics will correctly identify all instances of .lean, then we can drop it down to one sample per language and rely solely on the heuristics.

I see you've updated the search string in the OP which shows much wider usage than your original. Thanks.

@eric-wieser
Copy link
Contributor Author

eric-wieser commented Nov 22, 2023

It is tokenising the contents of the samples

Is this tokenization done using the language grammars, or is it a single tokenizer for all languages?

If you're 💯 certain your heuristics will correctly identify all instances of .lean, then we can drop it down to one sample per language and rely solely on the heuristics.

I am certain that the import [A-Z] vs import [a-z] heuristic will correctly classify more than 99% of complete lean files, which is better than I would hope for than a bayesian classifier.

Do the heuristics/classifier also run on ```lean code blocks (containing Lean fragments)? If so then the problem is a bit harder, and I should sort out the classifier.

I see you've updated the search string in the OP which shows much wider usage than your original. Thanks.

Yes, I was going to bring this up once I fixed the other issues. It took me a while to work out how to perform a case-sensitive search!

@lildude
Copy link
Member

lildude commented Nov 22, 2023

Is this tokenization done using the language grammars, or is it a single tokenizer for all languages?

Single tokenizer for all. The only thing that uses the grammars is the syntax highlighting enginer which is completely independent of Linguist.

I am certain that the import [A-Z] vs import [a-z] heuristic will correctly classify more than 99% of complete lean files, which is better than I would hope for than a bayesian classifier.

99% is good enough for me 😁

Do the heuristics/classifier also run on ```lean code blocks (containing Lean fragments)? If so then the problem is a bit harder, and I should sort out the classifier.

No. No analysis of content in codeblocks is ever performed. The author is expected to specify the language they want if they want syntax highlighting.

@eric-wieser
Copy link
Contributor Author

Do you want me to squash the commits?

@lildude
Copy link
Member

lildude commented Nov 28, 2023

Do you want me to squash the commits?

No need; we squash on merge.

This reduces the numbers of samples down to one, avoiding the statistical classifier.
@eric-wieser
Copy link
Contributor Author

@lildude: this should hopefully pass CI now.

@lildude
Copy link
Member

lildude commented Nov 29, 2023

Why are you deleting the samples/Lean/set.hlean sample? We try not to delete samples unless they're blatantly rubbish and this doesn't appear to be the case here.

@lildude
Copy link
Member

lildude commented Nov 29, 2023

Running bundle exec licensed cache -c vendor/licenses/config.yml and pushing the updated file should fix the failing tests.

@eric-wieser
Copy link
Contributor Author

Why are you deleting the samples/Lean/set.hlean sample? We try not to delete samples unless they're blatantly rubbish and this doesn't appear to be the case here.

If I don't delete it, then the classifier test still runs. Should I add lean so some ignore list instead? Of course, arguably the classifier test should only run on samples which aren't picked up by heuristics, but that sounds like a more involved change.

@lildude
Copy link
Member

lildude commented Nov 29, 2023

If I don't delete it, then the classifier test still runs.

Oooof. This is what I anticipated would happen initially when asking about how different things are between the two languages. So whilst the classifier isn't being used to assess the Lean 4 files (as there is only one sample), your Lean 4 sample is still being used to train the classifier. As there are now two .lean samples, the classifier will use the training model to assess the Lean sample and it's getting it wrong (Lean/binary.lean BAD (Lean 4)) because the tokens from your Lean 4 sample better match than those from the Lean samples.

Should I add lean so some ignore list instead?

No. There isn't one. The only option is to add more samples. Adding another Lean sample that definitely has no Lean 4 tokens should hopefully swing things back in Lean's favour for its sample.

Of course, arguably the classifier test should only run on samples which aren't picked up by heuristics, but that sounds like a more involved change.

This is deliberately not the case as the classifier is designed to be the last guess attempt for things that don't match the heuristics, hence they're not considered by the classifier at all. It is of course far from perfect and is really just a last ditch attempt.

@eric-wieser
Copy link
Contributor Author

and it's getting it wrong (Lean/binary.lean BAD (Lean 4)) because the tokens from your Lean 4 sample better match than those from the Lean samples.

To be clear, this doesn't happy any more now that the hlean sample has been deleted, as there is only one sample for each version

This is deliberately not the case as the classifier is designed to be the last guess attempt for things that don't match the heuristics, hence they're not considered by the classifier at all. It is of course far from perfect and is really just a last ditch attempt.

If the classifier is only considered as a last-ditch attempt, why is the test checking that it works on samples that always are solved by the first ditch and never make it to the last ditch?

@eric-wieser
Copy link
Contributor Author

eric-wieser commented Nov 29, 2023

No. There isn't one.

How about this one?

# Failures are reasonable in some cases, such as when a file is fully valid in more than one language.
allowed_failures = {
"#{samples_path}/C++/rpc.h" => ["C", "C++"],
}

@lildude
Copy link
Member

lildude commented Nov 29, 2023

We're trying to avoid adding to that as any new additions should not result in exclusions as it defeats the point of the classifier. That case is one where the sample existed before the most recent improvements so we added the exception rather than removing the sample.

@eric-wieser
Copy link
Contributor Author

Would you mind restarting CI one more time to check that everything is ok, except for the fact that you aren't happy with my deletion of the hlean file?

@lildude
Copy link
Member

lildude commented Dec 6, 2023

I've worked out what the issue is... the comments in the Lean 4 samples are being considered because the tokenizer doesn't recognise the /- ... -/ comment format used by Lean.

If you remove the comments from all the samples things start behaving correctly.

I'll create a PR to add support for this format of comment.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be good to revert bc21901 now

@eric-wieser
Copy link
Contributor Author

Done.

@lildude lildude added this pull request to the merge queue Dec 11, 2023
Merged via the queue into github-linguist:master with commit 6a9a3e4 Dec 11, 2023
5 checks passed
@eric-wieser
Copy link
Contributor Author

Thanks for all your help!

Is it too late for this to make it into #6627?

@lildude
Copy link
Member

lildude commented Dec 11, 2023

Is it too late for this to make it into #6627?

Nope. I'll be merging master into that PR and performing one last update of the grammars before making the release tomorrow morning.

@Alhadis
Copy link
Collaborator

Alhadis commented Dec 12, 2023

Missed opportunity to name the language LE4N, just saying. 😉

Thatchapanjaihan-github

This comment was marked as spam.

@eric-wieser
Copy link
Contributor Author

eric-wieser commented Jan 30, 2024

No. No analysis of content in codeblocks is ever performed. The author is expected to specify the language they want if they want syntax highlighting.

Just to check how exactly do I do that here?

-- with `Lean 4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

seems to incorrectly highlight as Lean 3 because everything after the is ignored (Lean 37 does the same thing)

-- with `Lean4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

gives no highighting

@lildude
Copy link
Member

lildude commented Jan 31, 2024

seems to incorrectly highlight as Lean 3 because everything after the is ignored (Lean 37 does the same thing)

Correct. You can see the Lean 3 grammar is used in the HTML source too:

Using Lean 3 grammar

It'll be highlight-source-lean4 if using the Lean 4 grammar.

Just to check how exactly do I do that here?

You need to replace the spaces in the language name with hyphens:

-- with `Lean-4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

Case also doesn't matter either:

-- with `lean-4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

I've just checked the Markup docs and this isn't mentioned, however we do mention it in one of the comments in the first example in our overrides doc.

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants