Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .bas extension to VB6 and heuristics for VB6/VBA #6355

Merged
merged 30 commits into from
May 30, 2023

Conversation

DecimalTurn
Copy link
Contributor

@DecimalTurn DecimalTurn commented Apr 2, 2023

This PR:

  • Adds the extension .bas as an extension for VB6
  • Adds samples for VB6 and VBA to help with disambiguation
  • Implements heuristics for VBA and VB6 with respect to the .bas extension

Description

This is the 3rd instalment of a 3 part series related to VB6 (and VBA).
For more context:
Part 1: Add Visual Basic 6.0 as its own language
Part 2: Add .frm and .cls extensions to VB6

The reason why I've waited a bit to add the .bas extension was because it is more tricky than for the extensions in part 2. Indeed, .bas is used for a module file and the attributes at the top of those files are identical between VBA and VB6.
ie.:
Attribute VB_Name = "ClassOrModuleName"
So, we can't use it to separate the two, but we can at least use it to distinguish VBA/VB6 from other languages using the .bas file extension (see new heuristic with named pattern vb-module).

Hence, the general approach that I'm proposing is the same idea as what was used for the .h extension as suggested by @lildude:

This whole VB-family of languages classification issue sounds like it's going to be a lot like the .h extension used by the C-family of languages which was a long battle until we finally resolved on the strategy in use and detailed in the troublshooting doc. If a similar approach can be found for these VB languages [and] their shared extensions, then that would be great. If not, the only option is manual overrides.

From #5824 (comment)

This comment refers to part of the troubleshooting.md file reproduced here with the appropriate substitution for the case of .bas :

Correctly detecting the language for the [Visual Basic .bas module] files is tough because Linguist detects the languages of files in isolation when analysing repositories and these [module] files, especially the smaller ones, can be used across [the 2] languages without using any language-specific content. To try and reduce the number of false positives and keep some degree of predicatability (sic) for users, Linguist will assume all [.bas module files with a VB_NAME Attribute] are [VB6] by default and will only identify a file as [VBA] if the content matches the specific heuristic for that language. This will mean you will need to implement an override for some of your [module] files if you wish for them to be classified as [VBA] if they do no contain language-specific content.

Difference between VBA and VB6
Alright, now let's get to the meat of it: what are the identifiable differences between VBA and VB6?

  1. VBA7 has new 64-bit features that were not implemented in VB6
  2. VBA has some unique top module declarations
    a. Option Private Module
    b. Option Compare Database
  3. VBA has preloaded libraries with objects that are common in VBA code

These 3 sections appear in the vba named pattern at the bottom of the heuristic file. I don't think there is a need for debate on (1) and (2) as those differences between VBA and VB6 are directly based on syntax and are well documented. However, I could see some disagreement regarding section (3) as there is a bit more of a grey area there.

Regarding the choices made for (3), I've included the invocation of a VBA library. This means that just having "Word" in the code is not enough to trigger the regex, you need to invoke a method/property/constant on it like (e.g. Word.Application). I've also included multiple common objects that are part of the DLLs VBA has natively access to.

Potential issues with section 3
The issue with this is that some VB6 code can reference the DLLs in question and then be "wrongly" labelled as VBA. I'm using quotes around "wrongly" because a case could be made that if you are using VB to automate an office application, you are basically writing VBA code at this point 😅.

Anyhow, I'm open to tweaking the list of objects included in case there is too many false positives, but since this approach gives VB6 as the default outcome for vb-like modules, it seems fair to give some extra chances to VBA.

@fafalone, @XusinboyBekchanov, @Gagniuc and @zspitz, I would be interested to have your opinion on this since you've been involved in discussions regarding this topic in the past. Cheers!

Checklist:

@DecimalTurn DecimalTurn requested a review from a team as a code owner April 2, 2023 23:12
@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 2, 2023

Looks like with the new samples, the FreeBasic samples are now detected as VB6:

FreeBasic/Plasma Generation.bas BAD (Visual Basic 6.0)
FreeBasic/try_catch_throw.bas BAD (Visual Basic 6.0)

@XusinboyBekchanov, are there any good samples you'd suggest to add for FreeBasic?

@zspitz
Copy link
Contributor

zspitz commented Apr 3, 2023

I would argue that the situation around VBA/VB6 is more akin to JavaScript. The language is the same language regardless of runtime environment (browser, Node.js) or loaded libraries and global objects.

My original intention was to use VBA/VB6 as the language name, but Linguist doesn't like slashes in language names so I went with VBA.

AFAICT the only argument for VB6 as a separate language is that there are some file extensions and file formats unique to VB6.

@Alhadis Alhadis changed the title Add .bas extension to VB6 and heuristics for VB6/VBA Add .bas extension to VB6 and heuristics for VB6/VBA Apr 3, 2023
@Alhadis
Copy link
Collaborator

Alhadis commented Apr 3, 2023

My original intention was to use VBA/VB6 as the language name

That strikes me as a combination of two equally-valid names for the same technology. In which case, either VBA or VB6 should be added as an alias instead. Otherwise, if you really must include a slash, then this is something that the fs_name field should (ideally) accommodate.

I would argue that the situation around VBA/VB6 is more akin to JavaScript.

Interesting analogy. 😉 Early versions of Internet Explorer actually did support VBA (or something with "Visual Basic" in its name), but it was discontinued due to lack of adoption and the growing popularity of JavaScript.

@zspitz
Copy link
Contributor

zspitz commented Apr 3, 2023

Early versions of Internet Explorer actually did support VBA (or something with "Visual Basic" in its name)

That would be VBScript, which (I would argue) is indeed a separate language. Most notably, because -- like JavaScript -- it has no notion of compile-time types.

@Alhadis
Copy link
Collaborator

Alhadis commented Apr 3, 2023

like JavaScript -- it has no notion of compile-time types.

Static typing is overrated anyway.

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 3, 2023

Thanks for the feedback @zspitz!
However, VBA and VB6 do have a different syntax when it comes to handling 64-bit stuff and there is also module declarations that you won't find in VB6.

This means that unlike JavaScript, you can't just copy-paste code across IDEs, add some librairies and expect the code to work. It won't compile in VB6.

For this reason, I would argue that we should keep them as seperate languages.

From the PR description:

Difference between VBA and VB6
Alright, now let's get to the meat of it: what are the identifiable differences between VBA and VB6?

VBA7 has new 64-bit features that were not implemented in VB6
VBA has some unique top module declarations
a. Option Private Module
b. Option Compare Database

@XusinboyBekchanov
Copy link
Contributor

Looks like with the new samples, the FreeBasic samples are now detected as VB6:

FreeBasic/Plasma Generation.bas BAD (Visual Basic 6.0)
FreeBasic/try_catch_throw.bas BAD (Visual Basic 6.0)

@XusinboyBekchanov, are there any good samples you'd suggest to add for FreeBasic?

Plasma Generation.bas uses a unique keyword for FreeBasic Shared: https://freebasic.net/wiki/KeyPgShared

The try_catch_throw.bas uses the #include keyword as specified in the heuristic: https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml#L116

Copy link
Collaborator

@Alhadis Alhadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a brief flashback to high school accounting classes (my only experience with Microsoft Excel) and recall formulae being case-insensitive. A quick Google search suggests Visual Basic—or one of its confuseables1is case-insensitive. If so, these regexes should begin with (?i) to mark them as case-insensitive matches. I'll leave this up to you to decide; I'm out-of-my-depth when it comes to Microsoft technologies named /Visual\s*(Studio|Basic)/.

Footnotes

  1. These technologies wouldn't be Microsoft if they weren't given confusing, forgettable and/or easily-rebranded names. Or in the case of Excel's Turing-complete formula functions, no programming language name whatsoever.

lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
@XusinboyBekchanov
Copy link
Contributor

XusinboyBekchanov commented Apr 3, 2023

I had a brief flashback to high school accounting classes (my only experience with Microsoft Excel) and recall formulae being case-insensitive. A quick Google search suggests Visual Basic—or one of its confuseables—is case-insensitive. If so, these regexes should begin with (?i) to mark them as case-insensitive matches. I'll leave this up to you to decide; I'm out-of-my-depth when it comes to Microsoft technologies named /Visual\s*(Studio|Basic)/.

This needs to be added for FreeBasic.

@XusinboyBekchanov
Copy link
Contributor

For VBA, you also need to add: Attribute VB_Name =

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 3, 2023

Plasma Generation.bas uses a unique keyword for FreeBasic Shared: freebasic.net/wiki/KeyPgShared

The try_catch_throw.bas uses the #include keyword as specified in the heuristic: master/lib/linguist/heuristics.yml#L116

Thanks for the clarification @XusinboyBekchanov, so my undestanding is that the supplied sample Plasma Generation.bas is not really code that can be found "in the wild" but more of an example to illustrate a feature of the language.

What if we added this sample below instead which is longer and more in line with a good sample to illustrate the use of the "shared" keyword?
https://github.com/freebasic/fbc/blob/e43074e5aa97a904a31c00d9917bc952308bddc0/src/compiler/ir.bas

@XusinboyBekchanov
Copy link
Contributor

XusinboyBekchanov commented Apr 3, 2023

Thanks for the clarification @XusinboyBekchanov, so my understanding is that the supplied sample Plasma Generation.bas is not really code that can be found "in the wild" but more of an example to illustrate a feature of the language.

What if we added this sample below instead which is longer and more in line with a good sample to illustrate the use of the "shared" keyword? https://github.com/freebasic/fbc/blob/e43074e5aa97a904a31c00d9917bc952308bddc0/src/compiler/ir.bas

The one provided in your example has the same #include. In this case, it is determined by this feature.

@Alhadis
Copy link
Collaborator

Alhadis commented Apr 3, 2023

Folks, let's try to avoid nested quote-blocks (preferably by quoting only the relevant parts of a comment). They make this discussion much harder to follow, not to mention encumbering CTRL+F searches to find a specific part of a user's original reply.

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 3, 2023

@XusinboyBekchanov, from what I understand, the main goal of the samples are to help the classifer make a guess when the heuristics aren't conclusive. So, it's important that the samples represent FreeBasic code "in the wild" to help classify real code and not contrived examples.

I'm going to change the sample Plasma Generation.bas for this sample to see if it helps with the classifier: freebasic/fbc@e43074e/src/compiler/ir.bas (License GNU GPLv2).

lib/linguist/heuristics.yml Outdated Show resolved Hide resolved
@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 3, 2023

For VBA, you also need to add: Attribute VB_Name =

@XusinboyBekchanov I've added this heuristic in the form of the named pattern "vb-module" as it is both used by VBA and VB6

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 3, 2023

[...] If so, these regexes should begin with (?i) to mark them as case-insensitive matches.

@Alhadis, thanks for letting me know this is possible. While it is true that Excel formulas are case-insensitive, it is not always true for VBA. For instance, the reserved keywords and compiler constants will have a specific capitalization that the VBA IDE will enforce as soon as you exit the line in the editor.

2023-04-02_23-45-16

For this reason, I don't think we should worry about the case insensitivity of VBA. The only place it can play a role is with variable names (sometimes), but that shouldn't apply to the heuristics used here.

@Alhadis
Copy link
Collaborator

Alhadis commented Apr 3, 2023

specific capitalization that the VBA IDE will enforce as soon as you exit the line in the editor.

Seriously? (An editor that does that would shit me right up the wall…)

For this reason, I don't think we should worry about the case insensitivity of VBA

Agreed, provided there's no substantial risk of these files being authored in other editing environments. However, @XusinboyBekchanov mentioned earlier that FreeBASIC's heuristics should be marked case-insensitive (where it makes sense to do so), although that might warrant a separate PR.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find smaller samples. Hint: if the PR is not displaying the full file content in the diff, it's probably unnecessarily large.

I think we're also going to need another FreeBasic sample or two (depends on how things behave after you replace the large samples) as the classifier is now detecting one of the current FreeBasic samples as Visual Basic 6.0:

FreeBasic/try_catch_throw.bas BAD (Visual Basic 6.0)

lib/linguist/languages.yml Show resolved Hide resolved
@DecimalTurn
Copy link
Contributor Author

Please find smaller samples. Hint: if the PR is not displaying the full file content in the diff, it's probably unnecessarily large.

@lildude, thanks for your patience on this one. At least I didn't go with a 8000+ lines sample this time, so there's progress 😅.
I'm working on some more adjustements to the heuristics and I'll make the changes for the samples this week.

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 9, 2023

Classifier cross-validation is still failing for these:

BASIC/mandelbrot.bas BAD (Visual Basic 6.0)
FreeBasic/try_catch_throw.bas BAD (Visual Basic 6.0)

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 9, 2023

Adding the following sample for BASIC fixed the issue with BASIC/mandelbrot.bas
https://github.com/raw/DualBrain/gw-basic/032cbf70bcddd5a971a81d7e56e795e61fe028f0/spacesc.bas
License: MIT

However, the issue with FreeBasic didn't improve by adding Picture.bas as a sample 😔

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 11, 2023

@lildude, I'd be interested to know if you have any more suggestions regarding this issue with samples for FreeBasic (FB) and the failing classifier tests. Just to recap :

What was the original impact of the PR?:

  • Both samples already present in Linguist for FB started to classify as VB6 (try_catch_throw.bas ❌ and Plasma Generation.bas ❌). Note that when you made your previous comment, it was right after I had temporarily removed the sample Plasma Generation.bas to see if that would help.

Which samples I've tried to add:

  • ir.bas => is correctly classifed as FB ✅
  • Picture.bas => is classified as VB6 ❌

So, after adding 2 more samples, we only have 1/4 passing the test. What should be done in this case?

@XusinboyBekchanov
Copy link
Contributor

You have provided several different heuristics. They have Attribute VB_Name = "ClassOrModuleName" inside them too.
But each provided heuristic must contain this check.

@DecimalTurn
Copy link
Contributor Author

DecimalTurn commented Apr 11, 2023

@XusinboyBekchanov, I'm not sure I understand your comment. With the current heuristics, a file will need to match the vb-module pattern AND any of the heuristics listed inside the vba pattern to be labelled as VBA. Hence, the presence of Attribute VB_Name = is made mandatory by the use of the vb-module pattern for VBA (and VB6).

  - language: VBA
    and:
    - named_pattern: vb-module
    - named_pattern: vba
  - language: Visual Basic 6.0
    named_pattern: vb-module
vb-module: '^\s*Attribute VB_Name = '

@XusinboyBekchanov
Copy link
Contributor

Maybe OR?
In your case, the FreeBasic examples should not have been marked as VB6.
Then you can combine all heuristics into one heuristic and check.

@DecimalTurn
Copy link
Contributor Author

@Alhadis, I believe that the suggestions you've made were addressed. Can you please have a look and let me know if there is something I missed or something else that could be improved?

@Alhadis
Copy link
Collaborator

Alhadis commented May 12, 2023

Hey @DecimalTurn, sorry about the wait. 👋 You've been very patient.

Yup, this LGTM. @lildude, any thoughts?

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. Thanks for your effort and patience.

Note: this PR will not be merged until close to when the next release is made. See here for more details.

@lildude lildude requested a review from a team as a code owner May 30, 2023 09:14
@lildude lildude added this pull request to the merge queue May 30, 2023
Merged via the queue into github-linguist:master with commit e9bd67d May 30, 2023
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants