Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coolstore evaluation #294

Merged
merged 17 commits into from
Aug 21, 2024
Merged

Coolstore evaluation #294

merged 17 commits into from
Aug 21, 2024

Conversation

pranavgaikwad
Copy link
Contributor

@pranavgaikwad pranavgaikwad commented Aug 13, 2024

This is a series of experiments we performed for Coolstore application evaluating the responses we get from Kai for various different migration issues under different conditions. We use an LLM to evaluate responses themselves.

Conclusion

Zero Shot prompts may work for modernizing small, somewhat easy and isolated examples of source code. The model needs to understand the target technology to produce useful results. However, when larger / complex codebases are involved, we observe that a smart model alone cannot help. We find that it's essential to pinpoint LLM to specific areas of codebase, and provide additional specific information about the issues to produce high quality responses.

Our results suggest that the approaches Kai is using to pinpoint issues using analysis information and solved examples improve the quality of responses. That being said, we also identify some problems with the current implementation of these approaches. We think that Kai must improve in following areas:

  1. Relevancy of solved examples: In the solved example diffs, do not include lines that may confuse the LLM.
  2. Fixes in the context of whole file: In the solved example summaries, try to provide summary by combining multiple incidents together in a file so it's shorter and takes into account effect of fixes affecting other issues in the file.

We believe that a mix of both diff and summary approaches might be better. More experimentation is needed to figure out how exactly that mix will look like.

Results

Here are some evaluation results we get for two different complexities of examples under three different approaches of forming prompts:

Easy Example

image

Hard Example

image

Learnings

About models

  • IBM Granite is very unpredictable, and we could not reliably determine the effect of adding analysis information on the quality of responses with it. It was hard to determine a pattern and correlate information.

  • Meta LLAMA 70b seemed to be performing more consistently (in all experiments) than others with relatively good quality of fixes.

  • GPT-4 performed better than others in Zero Shot experiments. However, it seemed to be fixing more than necessary, at times introducing new issues.

  • GPT-3 and LLAMA 13b have been comparable in terms of quality of fixes. On ther other hand, LLAMA 70b and GPT-4 are comparable.

  • Mixtral very often had problems returning a complete response, formatting etc. As a result, in some instances, we had to run an experiment multiple times to get results.

Zero Shot Prompts

  • Zero Shot prompts work as long as the underlying model is trained with code for target technology. No matter the model, the accuracy of fixes degrades as files become larger, and/or fixes become more complex.

  • Without any additional information from static code analysis, it was harder to get quality responses. Merely including line numbers of the incidents in the prompt did not help much. A text description about an issue that hints about a possible solution produced way better responses than line numbers alone. We believe that annotating input file with line numbers will produce a better output.

  • Zero Shot prompts may fix your issue but chances of them introducing new issues is higher because the issues are not pinpointed.

Few Shot with Solved Example Diffs

  • Prompts that contain diffs of actual solved examples produce much better results than any of the previous methods. You can observe this in the evaluation charts above. No matter how smart / not-smart the underlying model, the results improved with this strategy than Zero Shot approach.

  • However, there are limitations to this approach. We observed two main problems with this approach:

    1. With a big enough input file with multiple issues, prompts cross the input limits of most models. This is a blocking problem for Kai as it ends up not creating a response at all.

    2. The quality of response depends a lot on the quality of diff. A very pinpointed specific diff works better than a diff that contains lines that span more than just the issue in question. In fact, a larger diff with too many unrelated lines can hurt more than it helps. In order for this strategy to be helpful, Kai needs to pick the right examples. There needs to be some pre-processing on the diffs that makes sure only specific lines are saved.

  • Notice that for the easy example, the diff we included is a very specific diff that only contains two lines that are relevant to the issue.

  • For the hard example, we did use a diff from Kai's default data. However, we could not include diffs for all 10 issues in the file because of input limits. We only included one diff for the hardest fix in the file. Therefore, we will see worse performance with Kai as of today with this approach for larger files.

Few Shot with Solved Example Summary

  • In most of the experiments we did, this proved to be the most successful method for ensuring a high quality response. For some models, we did see instances of diff performing better. The drop we saw wasn't because of a huge problem though.

  • The main reason behind this being successful we think is that this addresses all problems / shortcomings of previous methods:

    1. The prompts contain pinpointed information unlike Zero-Shot
    2. With this approach, the prompts are significantly smaller and do not contain much out-of-scope information.
  • Like diff-only, the quality of responses highly depend on the quality of summary generated. At the time of writing this notebook, Kai generates summary of each incident. However, our experience running these experiments makes us believe that such a naive way of generating summary won't work reliably. Kai should employ more logic to combine multiple incidents together in a file and produce a summary by considering all of the issues, not just one.

About evaluation

  • Evaluating code with LLMs is possible. We need to improve on evaluation. The evaluator can be given supplemental information about the updated file to improve the accuracy of evaluation score. The two main pieces of information include:

    • Whether the issue was fixed - we can run static code analysis again to provide this piece of information
    • Syntactical issues - a simple linter should be enough
  • "Evaluator persona" can be further used to feed information back into Kai fix to get even better results.

@pranavgaikwad pranavgaikwad marked this pull request as draft August 13, 2024 12:38
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>

if __name__ == "__main__":
# replace following path with rules in your path
input_directory = '/home/pranav/Projects/rulesets/default/generated/quarkus/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably makes sense to make this a CLI argument or envvar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up not using that file at all, removed it now

Signed-off-by: Pranav Gaikwad <pgaikwad@redhat.com>
@fabianvf
Copy link
Contributor

Would be awesome to add the writeup in the PR/at the end of the notebook as a doc as a followup, but doesn't need to happen now

@fabianvf fabianvf merged commit 9bc5781 into konveyor:main Aug 21, 2024
5 checks passed
@pranavgaikwad pranavgaikwad deleted the evaluation branch August 21, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants