Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise the output to reduce token consumption #113

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mihaistate
Copy link

Created two functions in the esbmc_util file that use regex patterns to reduce the generated output of ESBMC-AI and another one that removes patterns by using the nltk library.
Added tests for the patterns that use specific ESBMC-AI phrases as input strings.
Changed the content of a system instruction in the config.json file to tell the LLM to avoid providing some informations that are not useful for debbuging and development puroposes, such as time measurements, assertions, assumptions or warnings. This approach leads to a more consistent and efficient output reduction alongside the three functions that use specific patterns.
Added new scenarious in the config.json file for some bugs.
The implementation has been tested on various SV-Benchmarks multiple times and they lead to consistent token consumption reductions.

Implemented a function in esbmc_util that uses regex patterns to remove some output
lines to reduce token consmuption and to facilitate the reduction of the likelihood that
LLMs like GPT 3.5 turbo reach the token limit for some codebases that need
comprehensive explanations.

Addded in config.json four new cases that address common bugs in C programs: buffer overflow,
arithmetic overflow, array out-of-bounds and memory leaks.

Signed-off-by: mihai.state <mihaita.state@yahoo.com>
Created more regex patterns that remove some lines that can be ommitted
and provide shorter outputs.
Added a few more indications in the config.json file that
tell the LLM to avoid mentioning time measurements

Signed-off-by: mihai.state <mihaita.state@yahoo.com>
Created new output scenarious in the config.json file
for NaN check, struct fields check, deadlock check and data races check
Created the reduce_output field in the config.json file to tell the
LLM to produce a smaller output
Created new regex patterns in the reduce_output2() function
Created the remove_patterns_nltk that uses NLP methods to identify
patterns in a sequence of tokenized words. The advantage of it are
that it provides more granular control over text sentences, it can
be more precise in some situations.
Created the test_output_reducer file that tests the functionality of the esbmc_output_optimisation(),
reduce_output2() and remove_patterns_nltk() functions that covers all the string patterns
Instantiated the GPT_4_TURBO_PREVIEW model in the enum of the ai_models.py and used it for various
code bases.

Signed-off-by: mihai.state <mihaita.state@yahoo.com>
@@ -7,11 +7,11 @@
import sys

# Enables arrow key functionality for input(). Do not remove import.
import readline
#import readline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you comment this out? It says clearly, "Do not remove import."

from . import config
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
nltk.download('punkt')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This downloads a library to your system without explicit user approval. This is bad design, there should be an option in the config to configure where it downloads this. We don't want to be downloading files to the user's systems without them knowing ;)

esbmc_output = re.sub(r'.*assignments.*|.*assignment.*', '', esbmc_output, flags=re.MULTILINE)
esbmc_output = re.sub(r'.*iteratively.*', '', esbmc_output, flags=re.MULTILINE)

esbmc_output = re.sub(r'.*branches.*', '', esbmc_output, flags=re.MULTILINE) #The informations about branches are not necessary relevant to the understanding of bugs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following sample:

#include <string.h>
#include <assert.h>

int main() {
	char* value = "branches";
	assert(strcmp("branches", value) == 0);
	return 0;
}

Would produce the following ESBMC output:

ESBMC version 7.5.0 64-bit x86_64 linux
Target: 64-bit little-endian x86_64-unknown-linux with esbmclibc
Parsing sample.c
Converting
Generating GOTO Program
GOTO program creation time: 0.097s
GOTO program processing time: 0.002s
Starting Bounded Model Checking
Unwinding loop 8 iteration 1   file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 2   file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 3   file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 4   file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 5   file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 6   file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 7   file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 8   file string.c line 96 column 12 function strcmp
Symex completed in: 0.002s (92 assignments)
Slicing time: 0.000s (removed 90 assignments)
Generated 19 VCC(s), 1 remaining after simplification (2 assignments)
No solver specified; defaulting to Boolector
Encoding remaining VCC(s) using bit-vector/floating-point arithmetic
Encoding to solver time: 0.000s
Solving with solver Boolector 3.2.3
Runtime decision procedure: 0.000s
Building error trace

[Counterexample]


State 1 file sample.c line 6 column 2 function main thread 0
----------------------------------------------------
Violated property:
  file sample.c line 6 column 2 function main
  assertion strcmp("branches", value) == 0
  return_value$_strcmp$1 == 0


VERIFICATION FAILED

Putting that through a regex viewer removes from the Violated property: section the line assertion strcmp("branches", value) == 0.

There are many more regex that are too generic to be included. You filter out ESBMC output that is useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this even do? Why was it included? You included it without a commit message explaining why it should be included in the project. Arguably, this could just be included as a generic script on the system.

@Yiannis128
Copy link
Collaborator

So this is quite a messy pull request, the pull request has some big issues both on a technical level and an experimental level.

Technical Issues

  1. You have deleted .env.example which is used as a template for people setting up ESBMC-AI before we make a better config solution.
  2. Why is there a myenv folder with random and unexplained scripts inside? Also why is there a folder path/to/your/virtualenv also committed? The message committing them does not even explain what they're for.
  3. The commit has a lot of empty tab spaces. Please make sure to format the code before committing it as it appears in the diff and makes it harder to read.
  4. The source code files are not formatted.
  5. In your regex, you have used flags=re.MULTILINE for operations that occur only on one line. There is no need for this and it would slow down the operations... You use flags=re.MULTILINE only when you want to apply the regex to each line.
  6. Some regex operations could be done without regex. For example, filtering strings out would be faster without regex.
  7. With this amount of regex, how can you guarantee it won't remove anything unintentionally? You said that you wrote tests, but please see this comment I've left. There are many more like this. If the current PR is merged, we will have random parts of the ESBMC output removed.
  8. The tests you included have the data inside. Would be better to have the data outside the test code and load it in using fixtures.
  9. You should add an option in the config to enable/disable this feature.

You shouldn't include everything in a commit that you work on. Only things that you would like included in the project.

image

Practical Issue

Do you have any figures that show any improvement in speed/accuracy of successful repairs? If the accuracy of repairs improves, we should know by how much. If it becomes worse, we need to know why, also it cannot be accepted. Also, how slow does this become due to all the regex?

Also, there's already a merge that happened that can reduce the ESBMC output to a few tokens as it uses only the violated property section only, please look at: #117 Maybe this PR can be added as another mode of that?

@mihaistate
Copy link
Author

mihaistate commented Apr 19, 2024

"1.Some regex operations could be done without regex. For example, filtering strings out would be faster without regex.
"
The regex pattern may prove to be more efficient than any other string method, we need abstract patterns that are able to remove sentences that will be phrased differentlty, but refer to the same idea, that is why I would suggest to stick to regex patterns. I agree that the string methods are faster, but they can't match all the phrases that point to the same idea

@Yiannis128
Copy link
Collaborator

"1.Some regex operations could be done without regex. For example, filtering strings out would be faster without regex. " The regex pattern may prove to be more efficient than any other string method, we need abstract patterns that are able to remove sentences that will be phrased differentlty, but refer to the same idea, that is why I would suggest to stick to regex patterns. I agree that the string methods are faster, but they can't match all the phrases that point to the same idea

The problem I have been trying to explain is that REGEX patterns used carelessly, like in this case where you are nuking lines with the word “branches” is not good design. I showed you an example where the useful output was altered.

Also, REGEX is extremely slow. Direct string manipulation is more rigid but also miles faster. So I don't know what you mean by: "The regex pattern may prove to be more efficient", efficient in terms of what?

Removed the 'flags=re.MULTILINE' argument for the regex patterns that work properly without it
Used the re.compile method for some patterns to reduce time complexity

Signed-off-by: mihai.state <mihaita.state@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants