Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to prefer ScanCode's file- over line-level findings #8152

Merged
merged 9 commits into from
Jan 29, 2024

Conversation

sschuberth
Copy link
Member

Please have a look at the individual commit messages for the details.

@sschuberth sschuberth requested a review from a team as a code owner January 22, 2024 12:32
Comment on lines +94 to 103
if (preferFileLicense && file is FileEntry.Version3 && file.detectedLicenseExpressionSpdx != null) {
licenseFindings += LicenseFinding(
license = file.detectedLicenseExpressionSpdx,
location = TextLocation(
path = file.path,
startLine = license.startLine,
endLine = license.endLine
startLine = licenses.minOf { it.startLine },
endLine = licenses.maxOf { it.endLine }
),
score = license.score
score = licenses.map { it.score }.average().toFloat()
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @Etsija, @lamppu, @mmurto: This should resemble DOS's logic.

Copy link

codecov bot commented Jan 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (47da430) 67.16% compared to head (ead5b94) 67.16%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #8152   +/-   ##
=========================================
  Coverage     67.16%   67.16%           
  Complexity     2049     2049           
=========================================
  Files           357      357           
  Lines         17158    17158           
  Branches       2461     2461           
=========================================
  Hits          11525    11525           
  Misses         4611     4611           
  Partials       1022     1022           
Flag Coverage Δ
funTest-docker 66.25% <ø> (ø)
funTest-non-docker 33.96% <ø> (ø)
test 36.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

),
score = license.score
score = licenses.map { it.score }.average().toFloat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you know the rational DOS use when coming up with an avg score?
Also, I wonder in what way the CLI option --license-score influences the per file license? (and whether this should become a command line config option).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you know the rational DOS use when coming up with an avg score?

@Etsija and @lamppu probably know better, but the average is pretty much the only metric that somewhat makes sense and is reasonably easy to calculate. Min. does not make sense, max. does not make sense.

Of course, you could do something "crazy" like only taking the scores of those line findings into account whose licenses are also contained in the file finding, and maybe even weight the score according to the number of lines matched, but that's much more work for questionable benefit.

I wonder in what way the CLI option --license-score influences the per file license?

Let's ask the author's: @pombredanne, does ScanCode's --license-score option have an (implicit) impact on which licenses will be part of the per-file detected_license_expression(_spdx) fields?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, you could do something "crazy" like only taking the scores of those line findings into account whose licenses are also contained in the file finding, and maybe even weight the score according to the number of lines matched, but that's much more work for questionable benefit.

Another alternative would be to omit lines and score entirely, with the rational that they stop being meaningful.
The reason I'm picking this up is, because (1) I suspect these values are a bit more likely to exhibit some bit of flakyness at least across scancode version (2) the result of the heuristic gets backed into the stored scan results and is thus very expensive to change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to omit lines and score entirely, with the rational that they stop being meaningful.

Making lines optional would be a bigger change in the data model, and it would prevent us from matching copyright to license findings, and probably have more implications.

However, I forgot that the scoreactually is optional already. Would you prefer to omit it instead of calculating the average?

I suspect these values are a bit more likely to exhibit some bit of flakyness at least across scancode version

More likely to be flaky than what, than the individual per-line findings? As the lines and score are calculated from the per-line findings, they can only be flaky if the per-line findings themselves are already flaky.

the result of the heuristic gets backed into the stored scan results and is thus very expensive to change.

Correct, but that's the way forward how ScanCode will operate, I guess. As this discussion shows, there seem to be no endeavors to enhance the quality of the per-line findings themselves, but to achieve better quality by post-processing these findings into the per-file finding.

Copy link
Member

@fviernau fviernau Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but that's the way forward how ScanCode will operate, I guess. As this discussion shows, there seem to be no endeavors to enhance the quality of the per-line findings themselves, but to achieve better quality by post-processing these findings into the per-file finding.

I don't much about the processing inside of scancode, so I don't want to move this discussion off-topic,
but it's already interesting that this is actually a post-processing step of per-line findings.
If the only input of this post processing step is the a per-line findings, then it could
also / in the long run make more sense to have that postprocessing step applied by ORT (using ScanCodes logic), not by ScanCode (at least not in a single step) in order to make use of the per-line finding curations. The process could be:

  1. Compute per line findings with ScanCode
  2. Apply per line finding curations aka license finding curations
  3. Take result from 2. as input for the post processing to compute the per-file findings using ScanCode's heuristic
    feeding in the curated license findings.

I believe this would exhibit several advantages over the current solution.

Also, IMO it'd be nice that per line findings would no more be mutually exclusive from per file findings.
Curations would be possible on both levels of granularity.

@pombredanne - what do you think about this idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make more sense to have that postprocessing step applied by ORT (using ScanCodes logic)

I agree, if that's possible, and ScanCode does not require internal data for this that's not written to the result file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I forgot that the scoreactually is optional already. Would you prefer to omit it instead of calculating the average?

@fviernau I still need you input here. I believe that's the last thing to clarify before we could merge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have slight tendency to omit it, but I also believe my knowledge is too limited in that area to make the call.
The reason I believe why its better to omit it, is because I believe the way the score would be defined would even harder to grasp and hardly be actionable. E.g. if an SPDX expression would consist of several 100% findings and a few 80% findings, the final score would have a relatively low score even though some parts of the expression apply a 100%. So,e.g. filtering out on score would not be feasible anymore, because one would have two different ways the score could have been computed.

So, all in all I prefer to omit it, but would be fine if you kept it. Up to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnonnenmacher what do you think about #8152 (comment) ?

@@ -90,7 +97,14 @@ class ScanCode internal constructor(
}
}

override val configuration by lazy { config.commandLine.joinToString(" ") }
override val configuration by lazy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the per file findings as entries into the text location based license findings which correspond to a line range an a specific file, also implies that the existing license finding curation mechanism becomes applicable to these per file entries.

So, from the perspective of creating the license finding curations, I wonder about the implication such as would we accept (e.g. in ort-config) also license finding curations which match such per file license findings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we accept (e.g. in ort-config) also license finding curations which match such per file license findings?

I think we should, mostly for two reasons:

  • there's hardly any negative impact compared to not doing so,
  • we should also accept curations for other supported scanners than ScanCode, and these other scanners could theoretically find exactly the file-level finding that ScanCode finds. If we'd accept such a curation for another scanner, we should accept it for ScanCode, too. (This somewhat reminds me of @mnonnenmacher's remark that our license finding curations should also include the scanner name and version.)

@fviernau
Copy link
Member

Can you help me understand whether this PR changes the "cache key" of ORT scan results? So that stored scan results would not match anymore with the new ORT version? (I have in mind changes like moving the strip-root option)

@sschuberth sschuberth force-pushed the scancode-per-file-findings branch 3 times, most recently from 0d397ff to 42d974d Compare January 24, 2024 07:34
@sschuberth
Copy link
Member Author

Can you help me understand whether this PR changes the "cache key" of ORT scan results?

First of all, this PR does not change any behavior of ORT when merged, as the option is disabled by default (String?.toBoolean() only returns true if the string is "true", and returns false in all other cases, including null).

Secondly, as the "fake" --prefer-file-license option becomes part of the scanner configuration if ort.scanner.config.ScanCode.options.preferFileLicense is true, then yes, the "cache key" changes, and results will not be "mixed up" compared to not using that options. Phrased differently, enabling that option for the first time causes rescans.

@sschuberth
Copy link
Member Author

I have in mind changes like moving the strip-root option

Ah, right, that would trigger rescans then, too. So I guess we should drop it, even if basically it's the right thing to do?

@fviernau
Copy link
Member

Ah, right, that would trigger rescans then, too. So I guess we should drop it, even if basically it's the right thing to do?

I'm not sure which option we'd choose. An alternative could be to migrate the scan result storage entries somehow.

@sschuberth
Copy link
Member Author

Ah, right, that would trigger rescans then, too. So I guess we should drop it, even if basically it's the right thing to do?

I'm not sure which option we'd choose. An alternative could be to migrate the scan result storage entries somehow.

I've anyway dropped the last fix(scancode): Move --strip-root to non-config options commit now to not make it delay the merge of this PR further.

@sschuberth sschuberth force-pushed the scancode-per-file-findings branch 6 times, most recently from 5162e4b to bfbda6c Compare January 24, 2024 18:04
Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
Move the knowledge about the default configuration from `ScanCode` to
`ScanCodeConfig`. This allows to make properties of the latter
non-nullable and more strongly typed (lists of strings instead of
unparsed strings). Give some related variables more appropriate names
in that context.

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
The chosen output does not impact the scanner results, so it should not
be considered as part of scanner configuration.

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
Make the output format option directly follow other command line options
and add a commend that it needs to precede the result file path. To
emphasize even more that the latter two belong together, put them into the
same line of code.

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
The output format was hard-coded, so the derived option name was
constant. Simplify the logic by inlining the whole output format
option.

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
This should not be part of the public API.

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
Make the property go first and group command line related code.

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
This is a preparation for highlighting the different number of findings
compared to when enabling a new scanner option introduced in the next
commit.

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
See [1] for discussions about the `detected_license_expression_spdx`, in
particular that it "is not merely the accumulation of the underlying
matches".

Optionally making use of this file-level license aligns ORT's behavior
with that of the Double Open Scanner (DOS), see [2], which is useful for
comparison of results.

[1]: aboutcode-org/scancode-toolkit#3458
[2]: https://github.com/doubleopen-project/dos/blob/616c582/apps/api/src/helpers/db_operations.ts#L55-L78

Signed-off-by: Sebastian Schuberth <sebastian@doubleopen.org>
@sschuberth
Copy link
Member Author

@oss-review-toolkit/kotlin-devs as there was no feedback on the topic of settings the score to null or keeping it at the average, I prefer to merge this as-is with the average to resemble what the Double Open Server (DOS) currently does.

@sschuberth sschuberth requested a review from a team January 29, 2024 10:19
@sschuberth sschuberth merged commit 4af6360 into main Jan 29, 2024
22 checks passed
@sschuberth sschuberth deleted the scancode-per-file-findings branch January 29, 2024 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants