Coolstore evaluation #294

pranavgaikwad · 2024-08-13T12:38:00Z

This is a series of experiments we performed for Coolstore application evaluating the responses we get from Kai for various different migration issues under different conditions. We use an LLM to evaluate responses themselves.

Conclusion

Zero Shot prompts may work for modernizing small, somewhat easy and isolated examples of source code. The model needs to understand the target technology to produce useful results. However, when larger / complex codebases are involved, we observe that a smart model alone cannot help. We find that it's essential to pinpoint LLM to specific areas of codebase, and provide additional specific information about the issues to produce high quality responses.

Our results suggest that the approaches Kai is using to pinpoint issues using analysis information and solved examples improve the quality of responses. That being said, we also identify some problems with the current implementation of these approaches. We think that Kai must improve in following areas:

Relevancy of solved examples: In the solved example diffs, do not include lines that may confuse the LLM.
Fixes in the context of whole file: In the solved example summaries, try to provide summary by combining multiple incidents together in a file so it's shorter and takes into account effect of fixes affecting other issues in the file.

We believe that a mix of both diff and summary approaches might be better. More experimentation is needed to figure out how exactly that mix will look like.

Results

Here are some evaluation results we get for two different complexities of examples under three different approaches of forming prompts:

Easy Example

Hard Example

Learnings

About models

IBM Granite is very unpredictable, and we could not reliably determine the effect of adding analysis information on the quality of responses with it. It was hard to determine a pattern and correlate information.
Meta LLAMA 70b seemed to be performing more consistently (in all experiments) than others with relatively good quality of fixes.
GPT-4 performed better than others in Zero Shot experiments. However, it seemed to be fixing more than necessary, at times introducing new issues.
GPT-3 and LLAMA 13b have been comparable in terms of quality of fixes. On ther other hand, LLAMA 70b and GPT-4 are comparable.
Mixtral very often had problems returning a complete response, formatting etc. As a result, in some instances, we had to run an experiment multiple times to get results.

Zero Shot Prompts

Zero Shot prompts work as long as the underlying model is trained with code for target technology. No matter the model, the accuracy of fixes degrades as files become larger, and/or fixes become more complex.
Without any additional information from static code analysis, it was harder to get quality responses. Merely including line numbers of the incidents in the prompt did not help much. A text description about an issue that hints about a possible solution produced way better responses than line numbers alone. We believe that annotating input file with line numbers will produce a better output.
Zero Shot prompts may fix your issue but chances of them introducing new issues is higher because the issues are not pinpointed.

Few Shot with Solved Example Diffs

Prompts that contain diffs of actual solved examples produce much better results than any of the previous methods. You can observe this in the evaluation charts above. No matter how smart / not-smart the underlying model, the results improved with this strategy than Zero Shot approach.
However, there are limitations to this approach. We observed two main problems with this approach:
1. With a big enough input file with multiple issues, prompts cross the input limits of most models. This is a blocking problem for Kai as it ends up not creating a response at all.
2. The quality of response depends a lot on the quality of diff. A very pinpointed specific diff works better than a diff that contains lines that span more than just the issue in question. In fact, a larger diff with too many unrelated lines can hurt more than it helps. In order for this strategy to be helpful, Kai needs to pick the right examples. There needs to be some pre-processing on the diffs that makes sure only specific lines are saved.
Notice that for the easy example, the diff we included is a very specific diff that only contains two lines that are relevant to the issue.
For the hard example, we did use a diff from Kai's default data. However, we could not include diffs for all 10 issues in the file because of input limits. We only included one diff for the hardest fix in the file. Therefore, we will see worse performance with Kai as of today with this approach for larger files.

Few Shot with Solved Example Summary

In most of the experiments we did, this proved to be the most successful method for ensuring a high quality response. For some models, we did see instances of diff performing better. The drop we saw wasn't because of a huge problem though.
The main reason behind this being successful we think is that this addresses all problems / shortcomings of previous methods:
1. The prompts contain pinpointed information unlike Zero-Shot
2. With this approach, the prompts are significantly smaller and do not contain much out-of-scope information.
Like diff-only, the quality of responses highly depend on the quality of summary generated. At the time of writing this notebook, Kai generates summary of each incident. However, our experience running these experiments makes us believe that such a naive way of generating summary won't work reliably. Kai should employ more logic to combine multiple incidents together in a file and produce a summary by considering all of the issues, not just one.

About evaluation

Evaluating code with LLMs is possible. We need to improve on evaluation. The evaluator can be given supplemental information about the updated file to improve the accuracy of evaluation score. The two main pieces of information include:
- Whether the issue was fixed - we can run static code analysis again to provide this piece of information
- Syntactical issues - a simple linter should be enough
"Evaluator persona" can be further used to feed information back into Kai fix to get even better results.

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

fabianvf · 2024-08-20T14:47:31Z

notebooks/evaluation/group_rules_by_effort.py

+
+if __name__ == "__main__":
+    # replace following path with rules in your path
+    input_directory = '/home/pranav/Projects/rulesets/default/generated/quarkus/'


Probably makes sense to make this a CLI argument or envvar

ended up not using that file at all, removed it now

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

fabianvf · 2024-08-21T15:11:27Z

Would be awesome to add the writeup in the PR/at the end of the notebook as a doc as a followup, but doesn't need to happen now

pranavgaikwad marked this pull request as draft August 13, 2024 12:38

pranavgaikwad added 10 commits August 13, 2024 16:18

👻 add notebooks for evaluation experiments

1ac0244

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

👻 automate evaluation

f3ecb9a

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

f1ec6c9

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

d8934da

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

758c6ec

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

9bbfa9f

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

e20aa1d

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

update

3b71e6e

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

8b8d3c7

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

55cae32

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

pranavgaikwad force-pushed the evaluation branch from 4103283 to 55cae32 Compare August 15, 2024 15:04

pranavgaikwad added 4 commits August 15, 2024 11:25

checkpoint

b15162c

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

a30f1e6

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

79a7389

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

402fcf6

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

pranavgaikwad force-pushed the evaluation branch from e3cd404 to 402fcf6 Compare August 18, 2024 19:06

pranavgaikwad marked this pull request as ready for review August 19, 2024 12:49

pranavgaikwad requested review from jwmatthews and jmontleon August 19, 2024 12:49

pranavgaikwad added 2 commits August 19, 2024 08:55

checkpoint

43f32fb

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

checkpoint

bf2c86b

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

jwmatthews approved these changes Aug 19, 2024

View reviewed changes

fabianvf reviewed Aug 20, 2024

View reviewed changes

checkpoint

17f4e2b

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

fabianvf merged commit 9bc5781 into konveyor:main Aug 21, 2024
5 checks passed

pranavgaikwad deleted the evaluation branch August 21, 2024 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coolstore evaluation #294

Coolstore evaluation #294

pranavgaikwad commented Aug 13, 2024 •

edited

Loading

fabianvf Aug 20, 2024

pranavgaikwad Aug 21, 2024

fabianvf commented Aug 21, 2024

Coolstore evaluation #294

Coolstore evaluation #294

Conversation

pranavgaikwad commented Aug 13, 2024 • edited Loading

Conclusion

Results

Easy Example

Hard Example

Learnings

About models

Zero Shot Prompts

Few Shot with Solved Example Diffs

Few Shot with Solved Example Summary

About evaluation

fabianvf Aug 20, 2024

Choose a reason for hiding this comment

pranavgaikwad Aug 21, 2024

Choose a reason for hiding this comment

fabianvf commented Aug 21, 2024

pranavgaikwad commented Aug 13, 2024 •

edited

Loading