Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to prefer ScanCode's file- over line-level findings #8152
Add an option to prefer ScanCode's file- over line-level findings #8152
Changes from all commits
9a35bd6
bf8a138
ae20031
7245f60
e6deb43
0046a2e
b405929
24e7163
ead5b94
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the per file findings as entries into the text location based license findings which correspond to a line range an a specific file, also implies that the existing license finding curation mechanism becomes applicable to these per file entries.
So, from the perspective of creating the license finding curations, I wonder about the implication such as would we accept (e.g. in ort-config) also license finding curations which match such per file license findings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should, mostly for two reasons:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you know the rational DOS use when coming up with an
avg
score?Also, I wonder in what way the CLI option
--license-score
influences the per file license? (and whether this should become a command line config option).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Etsija and @lamppu probably know better, but the average is pretty much the only metric that somewhat makes sense and is reasonably easy to calculate. Min. does not make sense, max. does not make sense.
Of course, you could do something "crazy" like only taking the scores of those line findings into account whose licenses are also contained in the file finding, and maybe even weight the score according to the number of lines matched, but that's much more work for questionable benefit.
Let's ask the author's: @pombredanne, does ScanCode's
--license-score
option have an (implicit) impact on which licenses will be part of the per-filedetected_license_expression(_spdx)
fields?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to omit
lines
andscore
entirely, with the rational that they stop being meaningful.The reason I'm picking this up is, because (1) I suspect these values are a bit more likely to exhibit some bit of flakyness at least across scancode version (2) the result of the heuristic gets backed into the stored scan results and is thus very expensive to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making lines optional would be a bigger change in the data model, and it would prevent us from matching copyright to license findings, and probably have more implications.
However, I forgot that the
score
actually is optional already. Would you prefer to omit it instead of calculating the average?More likely to be flaky than what, than the individual per-line findings? As the lines and score are calculated from the per-line findings, they can only be flaky if the per-line findings themselves are already flaky.
Correct, but that's the way forward how ScanCode will operate, I guess. As this discussion shows, there seem to be no endeavors to enhance the quality of the per-line findings themselves, but to achieve better quality by post-processing these findings into the per-file finding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't much about the processing inside of scancode, so I don't want to move this discussion off-topic,
but it's already interesting that this is actually a post-processing step of per-line findings.
If the only input of this post processing step is the a per-line findings, then it could
also / in the long run make more sense to have that postprocessing step applied by ORT (using ScanCodes logic), not by ScanCode (at least not in a single step) in order to make use of the per-line finding curations. The process could be:
feeding in the curated license findings.
I believe this would exhibit several advantages over the current solution.
Also, IMO it'd be nice that per line findings would no more be mutually exclusive from per file findings.
Curations would be possible on both levels of granularity.
@pombredanne - what do you think about this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, if that's possible, and ScanCode does not require internal data for this that's not written to the result file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fviernau I still need you input here. I believe that's the last thing to clarify before we could merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have slight tendency to omit it, but I also believe my knowledge is too limited in that area to make the call.
The reason I believe why its better to omit it, is because I believe the way the score would be defined would even harder to grasp and hardly be actionable. E.g. if an SPDX expression would consist of several 100% findings and a few 80% findings, the final score would have a relatively low score even though some parts of the expression apply a 100%. So,e.g. filtering out on score would not be feasible anymore, because one would have two different ways the score could have been computed.
So, all in all I prefer to omit it, but would be fine if you kept it. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnonnenmacher what do you think about #8152 (comment) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @Etsija, @lamppu, @mmurto: This should resemble DOS's logic.