Add support for Lean 4 #6616

eric-wieser · 2023-11-21T16:44:35Z

Description

Checklist:

I am adding a new language.
- The extension of the new language is used in hundreds of repositories on GitHub.com.
  - Search results for each extension:
    - with the community-owned packages:https://github.com/search?type=code&q=NOT+is%3Afork+path%3A*.lean+%2F%28%3F-i%29import+Mathlib%2F+-user%3AAG161
    - without the mathlib monorepo: https://github.com/search?type=code&q=NOT+is%3Afork+path%3A*.lean+%2F%28%3F-i%29import+Mathlib%2F+-user%3Aleanprover-community+-user%3AAG161
- I have included a real-world usage sample for all extensions added in this PR:
  - Sample source(s):
    - https://github.com/leanprover-community/mathlib4/blob/c6979569edc545f999b82d8a833b190c918aec2e/Archive/Wiedijk100Theorems/BirthdayProblem.lean
  - Sample license(s): Apache 2.0
- I have included a syntax highlighting grammar: https://github.com/leanprover/vscode-lean4
- I have added a color
  - Hex value: #RRGGBB
  - Rationale: the existing Lean language also did not choose a color.
- I have updated the heuristics to distinguish my language from others using the same extension.

lildude

Is Lean 4 really so syntactically different from the Lean we already support that it warrants its own entry, samples and grammar?

If this is just a case of getting improved syntax support for Lean 4, does the newer Lean 4 grammar also support earlier Lean versions? If so, it would be better to switch out just the grammar for the newer grammar.

If not, we can add support for Lean 4 as its own entry, but I recommend grouping it with the current Lean (add group: Lean), but note, this PR doesn't meet usage requirements so won't be merged until it does.

You also need to finish off the section you're adding to the languages.yml file as per the CONTRIBUTING.md file.

You will also need to ensure there are at least two samples for both Leans to train the classifier and ensure it can correctly identify the languages of the 4+ samples.

eric-wieser · 2023-11-22T00:47:03Z

Is Lean 4 really so syntactically different from the Lean we already support that it warrants its own entry, samples and grammar?

I'd argue yes. The syntax changes are much greater than the change from Python 2 to Python 3 (to give a hopefully more familiar example).

does the newer Lean 4 grammar also support earlier Lean versions?

No, nor is there any intent that this will be supported in future. Lean 3 -> Lean 4 is an almost complete rewrite of the language that was years in the making, and it is not expected for this to happen again any time soon.

but note, this PR doesn't meet usage requirements so won't be merged until it does.

Can you explain what the requirement is that we do not meet? https://github.com/topics/lean4 suggests that we have >200 repos, though perhaps that counts forks.

You will also need to ensure there are at least two samples for both Leans to train the classifier and ensure it can correctly identify the languages of the 4+ samples.

I've added one more of each; could you point me to any tests I should add for this? Also note that CI will not run on this PR without you hitting the button that might as well read "this user is not trying to hack our CI".

lildude · 2023-11-22T07:33:41Z

Can you explain what the requirement is that we do not meet? https://github.com/topics/lean4 suggests that we have >200 repos, though perhaps that counts forks.

#5756 For the latest, as referenced in the CONTRIBUTING.md file.

I've added one more of each; could you point me to any tests I should add for this?

The only additional tests would be for the heuristics which you've already added. The testing of the classifier is this one which can be run locally (or in Codespaces/devcontainer) too.

Also note that CI will not run on this PR without you hitting the button that might as well read "this user is not trying to hack our CI".

This is only because you're a new contributor but you should still be able to run the tests locally with bundle exec rake test too. I've clicked the magic button anyway.

lildude

Please update the template to list the source and licenses of all the samples you've added.

lib/linguist/languages.yml

grammars.yml

lildude · 2023-11-22T07:57:50Z

The classifier test failure is because your Lean and Lean 4 samples are not sufficiently different for the classifier to differentiate between the two languages for samples/Lean/binders.lean suggesting the Lean 4 samples do not contain enough of the unique features of the language or this file (and the other Lean samples) includes features nothing unique to Lean. Replacing them or adding more representative and illustrative, but real world samples with unique features (we don't want contrived samples) may help.

This could take a bit of time but you can speed up the local testing using bundle exec script/cross-validation --extensions=.lean.

The other test failure is clear.

eric-wieser · 2023-11-22T11:19:16Z

Thanks! Regarding the classification failure; is this using either the heuristics or the textmate grammars to tell the two apart? Or will adjusting those have no effect on the success rate?

(and if it's not using the heuristics, what is the point of these?)

lildude · 2023-11-22T11:47:27Z

Thanks! Regarding the classification failure; is this using either the heuristics or the textmate grammars to tell the two apart? Or will adjusting those have no effect on the success rate?

The classifier is using neither. It is tokenising the contents of the samples and then using that to perform a bayesian classification against the samples.

(and if it's not using the heuristics, what is the point of these?)

It's to catch those situations where the heuristics don't find any matches. If you're 💯 certain your heuristics will correctly identify all instances of .lean, then we can drop it down to one sample per language and rely solely on the heuristics.

I see you've updated the search string in the OP which shows much wider usage than your original. Thanks.

eric-wieser · 2023-11-22T12:06:19Z

It is tokenising the contents of the samples

Is this tokenization done using the language grammars, or is it a single tokenizer for all languages?

If you're 💯 certain your heuristics will correctly identify all instances of .lean, then we can drop it down to one sample per language and rely solely on the heuristics.

I am certain that the import [A-Z] vs import [a-z] heuristic will correctly classify more than 99% of complete lean files, which is better than I would hope for than a bayesian classifier.

Do the heuristics/classifier also run on ```lean code blocks (containing Lean fragments)? If so then the problem is a bit harder, and I should sort out the classifier.

I see you've updated the search string in the OP which shows much wider usage than your original. Thanks.

Yes, I was going to bring this up once I fixed the other issues. It took me a while to work out how to perform a case-sensitive search!

lildude · 2023-11-22T12:28:50Z

Is this tokenization done using the language grammars, or is it a single tokenizer for all languages?

Single tokenizer for all. The only thing that uses the grammars is the syntax highlighting enginer which is completely independent of Linguist.

I am certain that the import [A-Z] vs import [a-z] heuristic will correctly classify more than 99% of complete lean files, which is better than I would hope for than a bayesian classifier.

99% is good enough for me 😁

Do the heuristics/classifier also run on ```lean code blocks (containing Lean fragments)? If so then the problem is a bit harder, and I should sort out the classifier.

No. No analysis of content in codeblocks is ever performed. The author is expected to specify the language they want if they want syntax highlighting.

…ficient

Co-authored-by: Colin Seymour <colin@github.com>

eric-wieser · 2023-11-27T22:13:14Z

Do you want me to squash the commits?

lildude · 2023-11-28T08:10:03Z

Do you want me to squash the commits?

No need; we squash on merge.

This reduces the numbers of samples down to one, avoiding the statistical classifier.

eric-wieser · 2023-11-29T11:39:53Z

@lildude: this should hopefully pass CI now.

lildude · 2023-11-29T11:44:39Z

Why are you deleting the samples/Lean/set.hlean sample? We try not to delete samples unless they're blatantly rubbish and this doesn't appear to be the case here.

lildude · 2023-11-29T11:46:55Z

Running bundle exec licensed cache -c vendor/licenses/config.yml and pushing the updated file should fix the failing tests.

eric-wieser · 2023-11-29T11:54:13Z

Why are you deleting the samples/Lean/set.hlean sample? We try not to delete samples unless they're blatantly rubbish and this doesn't appear to be the case here.

If I don't delete it, then the classifier test still runs. Should I add lean so some ignore list instead? Of course, arguably the classifier test should only run on samples which aren't picked up by heuristics, but that sounds like a more involved change.

lildude · 2023-11-29T12:16:17Z

If I don't delete it, then the classifier test still runs.

Oooof. This is what I anticipated would happen initially when asking about how different things are between the two languages. So whilst the classifier isn't being used to assess the Lean 4 files (as there is only one sample), your Lean 4 sample is still being used to train the classifier. As there are now two .lean samples, the classifier will use the training model to assess the Lean sample and it's getting it wrong (Lean/binary.lean BAD (Lean 4)) because the tokens from your Lean 4 sample better match than those from the Lean samples.

Should I add lean so some ignore list instead?

No. There isn't one. The only option is to add more samples. Adding another Lean sample that definitely has no Lean 4 tokens should hopefully swing things back in Lean's favour for its sample.

Of course, arguably the classifier test should only run on samples which aren't picked up by heuristics, but that sounds like a more involved change.

This is deliberately not the case as the classifier is designed to be the last guess attempt for things that don't match the heuristics, hence they're not considered by the classifier at all. It is of course far from perfect and is really just a last ditch attempt.

eric-wieser · 2023-11-29T12:31:34Z

and it's getting it wrong (Lean/binary.lean BAD (Lean 4)) because the tokens from your Lean 4 sample better match than those from the Lean samples.

To be clear, this doesn't happy any more now that the hlean sample has been deleted, as there is only one sample for each version

This is deliberately not the case as the classifier is designed to be the last guess attempt for things that don't match the heuristics, hence they're not considered by the classifier at all. It is of course far from perfect and is really just a last ditch attempt.

If the classifier is only considered as a last-ditch attempt, why is the test checking that it works on samples that always are solved by the first ditch and never make it to the last ditch?

eric-wieser · 2023-11-29T12:33:10Z

No. There isn't one.

How about this one?

linguist/test/test_classifier.rb

Lines 47 to 50 in e211cdc

    
           # Failures are reasonable in some cases, such as when a file is fully valid in more than one language. 
        
           allowed_failures = { 
        
             "#{samples_path}/C++/rpc.h" => ["C", "C++"], 
        
           }

lildude · 2023-11-29T12:39:52Z

We're trying to avoid adding to that as any new additions should not result in exclusions as it defeats the point of the classifier. That case is one where the sample existed before the most recent improvements so we added the exception rather than removing the sample.

eric-wieser · 2023-11-30T14:51:33Z

Would you mind restarting CI one more time to check that everything is ok, except for the fact that you aren't happy with my deletion of the hlean file?

lildude · 2023-12-06T10:38:37Z

I've worked out what the issue is... the comments in the Lean 4 samples are being considered because the tokenizer doesn't recognise the /- ... -/ comment format used by Lean.

If you remove the comments from all the samples things start behaving correctly.

I'll create a PR to add support for this format of comment.

lildude

We should be good to revert bc21901 now

This reverts commit bc21901.

eric-wieser · 2023-12-11T12:28:41Z

Done.

eric-wieser · 2023-12-11T15:03:19Z

Thanks for all your help!

Is it too late for this to make it into #6627?

lildude · 2023-12-11T15:50:29Z

Is it too late for this to make it into #6627?

Nope. I'll be merging master into that PR and performing one last update of the grammars before making the release tomorrow morning.

Alhadis · 2023-12-12T14:43:01Z

Missed opportunity to name the language LE4N, just saying. 😉

eric-wieser · 2024-01-30T23:00:29Z

No. No analysis of content in codeblocks is ever performed. The author is expected to specify the language they want if they want syntax highlighting.

Just to check how exactly do I do that here?

-- with `Lean 4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

seems to incorrectly highlight as Lean 3 because everything after the is ignored (Lean 37 does the same thing)

-- with `Lean4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

gives no highighting

lildude · 2024-01-31T09:21:52Z

seems to incorrectly highlight as Lean 3 because everything after the is ignored (Lean 37 does the same thing)

Correct. You can see the Lean 3 grammar is used in the HTML source too:

It'll be highlight-source-lean4 if using the Lean 4 grammar.

Just to check how exactly do I do that here?

You need to replace the spaces in the language name with hyphens:

-- with `Lean-4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

Case also doesn't matter either:

-- with `lean-4` as the infostring
abbreviation a_lean_3_line := tt
abbrev aLean4Line := true

I've just checked the Markup docs and this isn't mentioned, however we do mention it in one of the comments in the first example in our overrides doc.

eric-wieser added 7 commits November 21, 2023 16:44

Add support for Lean 4

47fea05

Update heuristics.yml

60b7aaa

Update languages.yml

4ab14ca

add the submodule

8e1f70a

add readme item

7c625f6

Merge remote-tracking branch 'refs/remotes/origin/patch-1' into patch-1

885bcf8

Create BirthdayProblem.lean

8866c79

eric-wieser marked this pull request as ready for review November 21, 2023 17:11

eric-wieser requested a review from a team as a code owner November 21, 2023 17:11

lildude requested changes Nov 21, 2023

View reviewed changes

more samples

d870a65

eric-wieser added 2 commits November 22, 2023 00:54

actually run the scripts

9fa2c11

whoops, what a mess

6594d71

eric-wieser force-pushed the patch-1 branch from 86bb7f2 to 6594d71 Compare November 22, 2023 00:56

add heuristic

8cb1bd1

eric-wieser mentioned this pull request Nov 22, 2023

Update GitHub syntax high-light for Lean 4 leanprover/lean4#377

Closed

lildude requested changes Nov 22, 2023

View reviewed changes

lib/linguist/languages.yml Show resolved Hide resolved

lildude reviewed Nov 22, 2023

View reviewed changes

grammars.yml Outdated Show resolved Hide resolved

Alphabetize

bae8af0

eric-wieser and others added 3 commits November 22, 2023 22:30

delete the extra samples again, as the heuristics alone should be suf…

7a86a21

…ficient

add group

330f99f

Co-authored-by: Colin Seymour <colin@github.com>

update the vscode package

faff89e

Delete samples/Lean/set.hlean

bc21901

This reduces the numbers of samples down to one, avoiding the statistical classifier.

fix config file

e211cdc

eric-wieser mentioned this pull request Dec 6, 2023

Add Luau language #6612

Merged

6 tasks

lildude mentioned this pull request Dec 6, 2023

Add support for the lean comment format to the tokenizer #6625

Merged

2 tasks

Merge branch 'master' into patch-1

97a6179

lildude requested changes Dec 11, 2023

View reviewed changes

Revert "Delete samples/Lean/set.hlean"

ba07896

This reverts commit bc21901.

lildude approved these changes Dec 11, 2023

View reviewed changes

lildude added this pull request to the merge queue Dec 11, 2023

Merged via the queue into github-linguist:master with commit 6a9a3e4 Dec 11, 2023
5 checks passed

This comment was marked as spam.

Sign in to view

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Lean 4 #6616

Add support for Lean 4 #6616

eric-wieser commented Nov 21, 2023 •

edited

Loading

lildude left a comment

eric-wieser commented Nov 22, 2023

lildude commented Nov 22, 2023

lildude left a comment

lildude commented Nov 22, 2023 •

edited

Loading

eric-wieser commented Nov 22, 2023 •

edited

Loading

lildude commented Nov 22, 2023

eric-wieser commented Nov 22, 2023 •

edited

Loading

lildude commented Nov 22, 2023

eric-wieser commented Nov 27, 2023

lildude commented Nov 28, 2023

eric-wieser commented Nov 29, 2023

lildude commented Nov 29, 2023

lildude commented Nov 29, 2023

eric-wieser commented Nov 29, 2023

lildude commented Nov 29, 2023

eric-wieser commented Nov 29, 2023

eric-wieser commented Nov 29, 2023 •

edited

Loading

lildude commented Nov 29, 2023 •

edited

Loading

eric-wieser commented Nov 30, 2023

lildude commented Dec 6, 2023

lildude left a comment

eric-wieser commented Dec 11, 2023

eric-wieser commented Dec 11, 2023

lildude commented Dec 11, 2023

Alhadis commented Dec 12, 2023

This comment was marked as spam.

eric-wieser commented Jan 30, 2024 •

edited

Loading

lildude commented Jan 31, 2024

Add support for Lean 4 #6616

Add support for Lean 4 #6616

Conversation

eric-wieser commented Nov 21, 2023 • edited Loading

Description

Checklist:

lildude left a comment

Choose a reason for hiding this comment

eric-wieser commented Nov 22, 2023

lildude commented Nov 22, 2023

lildude left a comment

Choose a reason for hiding this comment

lildude commented Nov 22, 2023 • edited Loading

eric-wieser commented Nov 22, 2023 • edited Loading

lildude commented Nov 22, 2023

eric-wieser commented Nov 22, 2023 • edited Loading

lildude commented Nov 22, 2023

eric-wieser commented Nov 27, 2023

lildude commented Nov 28, 2023

eric-wieser commented Nov 29, 2023

lildude commented Nov 29, 2023

lildude commented Nov 29, 2023

eric-wieser commented Nov 29, 2023

lildude commented Nov 29, 2023

eric-wieser commented Nov 29, 2023

eric-wieser commented Nov 29, 2023 • edited Loading

lildude commented Nov 29, 2023 • edited Loading

eric-wieser commented Nov 30, 2023

lildude commented Dec 6, 2023

lildude left a comment

Choose a reason for hiding this comment

eric-wieser commented Dec 11, 2023

eric-wieser commented Dec 11, 2023

lildude commented Dec 11, 2023

Alhadis commented Dec 12, 2023

This comment was marked as spam.

eric-wieser commented Jan 30, 2024 • edited Loading

lildude commented Jan 31, 2024

eric-wieser commented Nov 21, 2023 •

edited

Loading

lildude commented Nov 22, 2023 •

edited

Loading

eric-wieser commented Nov 22, 2023 •

edited

Loading

eric-wieser commented Nov 22, 2023 •

edited

Loading

eric-wieser commented Nov 29, 2023 •

edited

Loading

lildude commented Nov 29, 2023 •

edited

Loading

eric-wieser commented Jan 30, 2024 •

edited

Loading