-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coolstore evaluation #294
Coolstore evaluation #294
Conversation
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
4103283
to
55cae32
Compare
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
e3cd404
to
402fcf6
Compare
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
|
||
if __name__ == "__main__": | ||
# replace following path with rules in your path | ||
input_directory = '/home/pranav/Projects/rulesets/default/generated/quarkus/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably makes sense to make this a CLI argument or envvar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ended up not using that file at all, removed it now
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Would be awesome to add the writeup in the PR/at the end of the notebook as a doc as a followup, but doesn't need to happen now |
This is a series of experiments we performed for Coolstore application evaluating the responses we get from Kai for various different migration issues under different conditions. We use an LLM to evaluate responses themselves.
Conclusion
Zero Shot prompts may work for modernizing small, somewhat easy and isolated examples of source code. The model needs to understand the target technology to produce useful results. However, when larger / complex codebases are involved, we observe that a smart model alone cannot help. We find that it's essential to pinpoint LLM to specific areas of codebase, and provide additional specific information about the issues to produce high quality responses.
Our results suggest that the approaches Kai is using to pinpoint issues using analysis information and solved examples improve the quality of responses. That being said, we also identify some problems with the current implementation of these approaches. We think that Kai must improve in following areas:
We believe that a mix of both diff and summary approaches might be better. More experimentation is needed to figure out how exactly that mix will look like.
Results
Here are some evaluation results we get for two different complexities of examples under three different approaches of forming prompts:
Easy Example
Hard Example
Learnings
About models
IBM Granite is very unpredictable, and we could not reliably determine the effect of adding analysis information on the quality of responses with it. It was hard to determine a pattern and correlate information.
Meta LLAMA 70b seemed to be performing more consistently (in all experiments) than others with relatively good quality of fixes.
GPT-4 performed better than others in Zero Shot experiments. However, it seemed to be fixing more than necessary, at times introducing new issues.
GPT-3 and LLAMA 13b have been comparable in terms of quality of fixes. On ther other hand, LLAMA 70b and GPT-4 are comparable.
Mixtral very often had problems returning a complete response, formatting etc. As a result, in some instances, we had to run an experiment multiple times to get results.
Zero Shot Prompts
Zero Shot prompts work as long as the underlying model is trained with code for target technology. No matter the model, the accuracy of fixes degrades as files become larger, and/or fixes become more complex.
Without any additional information from static code analysis, it was harder to get quality responses. Merely including line numbers of the incidents in the prompt did not help much. A text description about an issue that hints about a possible solution produced way better responses than line numbers alone. We believe that annotating input file with line numbers will produce a better output.
Zero Shot prompts may fix your issue but chances of them introducing new issues is higher because the issues are not pinpointed.
Few Shot with Solved Example Diffs
Prompts that contain diffs of actual solved examples produce much better results than any of the previous methods. You can observe this in the evaluation charts above. No matter how smart / not-smart the underlying model, the results improved with this strategy than Zero Shot approach.
However, there are limitations to this approach. We observed two main problems with this approach:
With a big enough input file with multiple issues, prompts cross the input limits of most models. This is a blocking problem for Kai as it ends up not creating a response at all.
The quality of response depends a lot on the quality of diff. A very pinpointed specific diff works better than a diff that contains lines that span more than just the issue in question. In fact, a larger diff with too many unrelated lines can hurt more than it helps. In order for this strategy to be helpful, Kai needs to pick the right examples. There needs to be some pre-processing on the diffs that makes sure only specific lines are saved.
Notice that for the easy example, the diff we included is a very specific diff that only contains two lines that are relevant to the issue.
For the hard example, we did use a diff from Kai's default data. However, we could not include diffs for all 10 issues in the file because of input limits. We only included one diff for the hardest fix in the file. Therefore, we will see worse performance with Kai as of today with this approach for larger files.
Few Shot with Solved Example Summary
In most of the experiments we did, this proved to be the most successful method for ensuring a high quality response. For some models, we did see instances of diff performing better. The drop we saw wasn't because of a huge problem though.
The main reason behind this being successful we think is that this addresses all problems / shortcomings of previous methods:
Like diff-only, the quality of responses highly depend on the quality of summary generated. At the time of writing this notebook, Kai generates summary of each incident. However, our experience running these experiments makes us believe that such a naive way of generating summary won't work reliably. Kai should employ more logic to combine multiple incidents together in a file and produce a summary by considering all of the issues, not just one.
About evaluation
Evaluating code with LLMs is possible. We need to improve on evaluation. The evaluator can be given supplemental information about the updated file to improve the accuracy of evaluation score. The two main pieces of information include:
"Evaluator persona" can be further used to feed information back into Kai fix to get even better results.