Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make regexps in heuristics.yml more portable #6417

Merged

Conversation

jorendorff
Copy link
Contributor

@jorendorff jorendorff commented May 18, 2023

Description

Hi. My team is porting Linguist to Rust, and we'd like to reuse the regular expressions in heuristics.yml.

They contain a few Rubyisms—a fairly small number relative to the whole file, but enough that we have to deal with them somehow.

We'd like to upstream the changes, in the hope that other projects (like go-enry) can benefit. These changes make the regexps more compatible with languages like JavaScript, Go, Python, and Rust, without affecting the behavior.

In this PR:

  • Change { to \{ in many rules. Curly braces are special characters in regexps. The Ruby docs actually say that they have to be escaped, but the implementation silently allows it if you don't. Other languages are stricter, though.
  • Change \< to < in two rules.
  • Change \R to \r?\n in two rules, and reword a third to avoid it. The meaning of \R in Ruby includes a few more characters than \r?\n, so this is a slight change in behavior, but it's very unlikely that any files of these particular types are using \v or \f, or Unicode paragraph separators, etc. as line breaks in practice.
  • Remove uses of (?m) because it's not portable — it means one thing in Ruby and a totally different thing in Python, Rust, and JavaScript. The behavior of the regexps has not been changed, though; when removing (?m) I also replaced . with (?:.|\n) or some other equivalent formula. (This didn't make any regexps much longer, happily.)

This doesn't make all of the regexes portable to all languages, but it's an easy big improvement.

Checklist:

N/A

This replaces `\R` with `\r?\n` in two places, which actually slightly changes the meaning, but I believe the regexps were using `\R` because that particular file type is common on Windows.

In a third place, `\R` is replaced with `^`. This works because it follows `(?m:.*?)`, which matches newline characters.
This flag means something different in Ruby vs. other languages, including JavaScript.
@lildude lildude requested review from lildude and Alhadis May 19, 2023 07:19
Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I'll merge this when I make the next release, hopefully in the next few weeks, depending on "you know what" 😉 @jorendorff.

Heads up @Alhadis as you deal with a lot of the regex queries and reviews.

Copy link
Collaborator

@Alhadis Alhadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces subtle incompatibilities regarding newlines:

Change \R to \r?\n

This should actually be (?:\r?\n|\r), so that CR line-endings are handled consistently between programming environments.

I also replaced . with (?:.|\n) or some other equivalent formula.

(?:.|\n) won't match carriage returns in CRLF endings, so I recommend using (?:.|[\r\n]) or (?:\s|\S) instead.

@Alhadis Alhadis changed the title Make regexps in heuristics.yml more portable Make regexps in heuristics.yml more portable May 23, 2023
@DecimalTurn
Copy link
Contributor

This should actually be (?:\r?\n|\r), so that CR line-endings are handled consistently between programming environments.

Note that for INI and VB6, since they are Windows specific, we don't really have to worry about the lone CR line-ending case.

@Alhadis
Copy link
Collaborator

Alhadis commented May 23, 2023

Note that for INI and VB6, since they are Windows specific, we don't really have to worry about the lone CR line-ending case.

One could make the same argument about LF endings on Windows (which uses CRLF). However, it's better that we enforce a consistent strategy for matching newlines in our heuristics, as it eliminates this sort of headache.

@DecimalTurn
Copy link
Contributor

One could make the same argument about LF endings on Windows (which uses CRLF). However, it's better that we enforce a consistent strategy for matching newlines in our heuristics, as it eliminates this sort of headache.

It's different for LF endings since git replaces CRLF with LF when a file is recognized as text.
However, I can get behind using (?:\r?\n|\r) everywhere if our goal is to have a constistent strategy for line endings, but we should then also replace existing instances of \r?\n such as :

  - language: Gerber Image
    pattern: '^[DGMT][0-9]{2}\*\r?\n'

And what do we do for the cases with only \n like the following?

  - language: G-code
    pattern: '^[MG][0-9]+\n'

@Alhadis
Copy link
Collaborator

Alhadis commented May 23, 2023

but we should then also replace existing instances of \r?\n

Yes, you're absolutely right we should. This goes for heuristics that naïvely match \n instead of something more platform-agnostic (I confess I'm guilty of making this mistake all the time…).

UPDATE: Never mind, this doesn't actually work consistently between engines after all (see @jorendorff's correction below). Leaving the below rambling as-is for transparency's sake…

(?<ignore_lol>) However, if the heuristic matches a newline as the final part of a pattern, then it can be replaced with `$` instead:
   - language: Gerber Image
-    pattern: '^[DGMT][0-9]{2}\*\r?\n'
+    pattern: '^[DGMT][0-9]{2}\*$'

   - language: G-code
-    pattern: '^[MG][0-9]+\n'
+    pattern: '^[MG][0-9]+$'

It's a different story if something else needs to be matched after the newline, though; i.e., if you're trying to match something like a YAML version header:

^%YAML 1\.2(?:\r?\n|\r)---$

@jorendorff
Copy link
Contributor Author

jorendorff commented May 23, 2023

However, if the heuristic matches a newline as the final part of a pattern, then it can be replaced with $ instead:

Unfortunately, $ does not match before \r in Python, Perl, or Rust. It works in JS, but it's not portable. :-(

EDIT: If I'm reading this right, it doesn't work in Ruby either:

irb(main):001:0> $x = "xyz\r\n"
=> "xyz\r\n"
irb(main):002:0> /xyz$/.match($x)
=> nil

@Alhadis
Copy link
Collaborator

Alhadis commented May 23, 2023

Unfortunately, $ does not match before \r in Python, Perl, or Rust. It works in JS, but it's not portable. :-(

You're right. 😓 (Note to self: Stop using Firefox's console to test regular expressions…)

Alright, scratch everything after the second paragraph in my last reply. Just stick with (?:\r?\n|\r) like God intended.

@jorendorff
Copy link
Contributor Author

jorendorff commented May 24, 2023

OK. Each place where I changed \N \R to \r?\n (only 2 patterns), I've now changed it to (?:\r?\n|\r).

Changing all other places where \n appears is a bigger change. Let me know if you want that. I think it should be a separate PR.

@DecimalTurn
Copy link
Contributor

I guess you meant \R and not \N.

@jorendorff
Copy link
Contributor Author

Yes, that's right. Sorry for the error.

@DecimalTurn
Copy link
Contributor

DecimalTurn commented May 25, 2023

No worries. Everything looks good in terms of the replacement of \R instances and I agree that we can wait for another PR to apply our new line-endings strategy for the rest of the file.

However, I think the second half of @Alhadis' request for changes wasn't addressed yet:

I also replaced . with (?:.|\n) or some other equivalent formula.

(?:.|\n) won't match carriage returns in CRLF endings, so I recommend using (?:.|[\r\n]) or (?:\s|\S) instead.

This affects three regexps I previously touched to remove `(?m)`. `(?:.|\r)`
matches any character in most languages, but not in JS, so this commit switches
to `(?:.|[\r\n])`.
@jorendorff
Copy link
Contributor Author

Sorry, I missed that bit. Fixed now.

@Alhadis
Copy link
Collaborator

Alhadis commented May 28, 2023

Changing all other places where \n appears is a bigger change. […] I think it should be a separate PR.

Agreed. For now, let's keep this one atomic.

@lildude lildude requested a review from a team as a code owner May 30, 2023 09:08
@lildude lildude added this pull request to the merge queue May 30, 2023
Merged via the queue into github-linguist:master with commit ae78fc7 May 30, 2023
5 checks passed
@@ -675,7 +675,7 @@ disambiguations:
- extensions: ['.stl']
rules:
- language: STL
pattern: '\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)'
pattern: '\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooof!! This new regex is causing timeouts on GitHub.com. Why? Because it suffers from catastrophic backtracking with larger files... precisely where this change is as can be seen at https://regex101.com/r/2JKkxf/1 (contrived example using the sample we have)

The old regex doesn't hit this problem as it doesn't seem to be matching the same thing and determines this much sooner 😁 - https://regex101.com/r/gBFDDP/1

/cc @Alhadis @jorendorff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something obvious, but ^solid[\s\S]*endsolid$ may do the trick. If we want to keep things closer to as they currently are, \A\s*solid[\s\S]*endsolid(?:$|\s) does the trick too with my sample and several of the customer files found tripping things up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\A\s*solid[\s\S]*^endsolid(?:$|\s) keeps things closer to the original by requiring endsolid to be on a new line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the old regex doesn't match anything in regex101 is that regex101 is using PCRE, not Ruby, and the old regex uses Ruby-specific features that PCRE doesn't understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both the old regex and the new one on a 5000-line file.

I used this Ruby script.

require 'benchmark'

LAST_GOOD_RE = /\A\s*solid(?=$|\s)(?m:.*?)\Rendsolid(?:$|\s)/
FIRST_BAD_RE = /\A\s*solid(?=$|\s)(?:.|[\r\n])*?^endsolid(?:$|\s)/

text = File.read("SV05_bed.stl")

t = Benchmark.measure {
  LAST_GOOD_RE.match?(text)
}
puts t.real

t = Benchmark.measure {
  FIRST_BAD_RE.match?(text)
}
puts t.real

If I'm reading this right, the old regex runs in 2ms and the new one runs in 4ms, on my laptop. So it really is slower.

@bzz
Copy link
Contributor

bzz commented Sep 6, 2023

Mad props for simplifying regexps syntax and making it portable across the libraries!
@jorendorff you probably know all about it already, but internally we have teams using https://github.com/go-enry/rs-enry to their satisfaction.

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants