Optimise the output to reduce token consumption #113

mihaistate · 2024-03-26T11:02:59Z

Created two functions in the esbmc_util file that use regex patterns to reduce the generated output of ESBMC-AI and another one that removes patterns by using the nltk library.
Added tests for the patterns that use specific ESBMC-AI phrases as input strings.
Changed the content of a system instruction in the config.json file to tell the LLM to avoid providing some informations that are not useful for debbuging and development puroposes, such as time measurements, assertions, assumptions or warnings. This approach leads to a more consistent and efficient output reduction alongside the three functions that use specific patterns.
Added new scenarious in the config.json file for some bugs.
The implementation has been tested on various SV-Benchmarks multiple times and they lead to consistent token consumption reductions.

Implemented a function in esbmc_util that uses regex patterns to remove some output lines to reduce token consmuption and to facilitate the reduction of the likelihood that LLMs like GPT 3.5 turbo reach the token limit for some codebases that need comprehensive explanations. Addded in config.json four new cases that address common bugs in C programs: buffer overflow, arithmetic overflow, array out-of-bounds and memory leaks. Signed-off-by: mihai.state <mihaita.state@yahoo.com>

Created more regex patterns that remove some lines that can be ommitted and provide shorter outputs. Added a few more indications in the config.json file that tell the LLM to avoid mentioning time measurements Signed-off-by: mihai.state <mihaita.state@yahoo.com>

Created new output scenarious in the config.json file for NaN check, struct fields check, deadlock check and data races check Created the reduce_output field in the config.json file to tell the LLM to produce a smaller output Created new regex patterns in the reduce_output2() function Created the remove_patterns_nltk that uses NLP methods to identify patterns in a sequence of tokenized words. The advantage of it are that it provides more granular control over text sentences, it can be more precise in some situations. Created the test_output_reducer file that tests the functionality of the esbmc_output_optimisation(), reduce_output2() and remove_patterns_nltk() functions that covers all the string patterns Instantiated the GPT_4_TURBO_PREVIEW model in the enum of the ai_models.py and used it for various code bases. Signed-off-by: mihai.state <mihaita.state@yahoo.com>

Yiannis128 · 2024-04-16T14:48:11Z

esbmc_ai/__main__.py

@@ -7,11 +7,11 @@
 import sys

 # Enables arrow key functionality for input(). Do not remove import.
-import readline
+#import readline


Why did you comment this out? It says clearly, "Do not remove import."

Yiannis128 · 2024-04-16T14:50:53Z

esbmc_ai/esbmc_util.py

 from . import config
+import nltk
+from nltk.tokenize import word_tokenize,sent_tokenize
+nltk.download('punkt')


This downloads a library to your system without explicit user approval. This is bad design, there should be an option in the config to configure where it downloads this. We don't want to be downloading files to the user's systems without them knowing ;)

Yiannis128 · 2024-04-16T15:10:23Z

esbmc_ai/esbmc_util.py

+    esbmc_output = re.sub(r'.*assignments.*|.*assignment.*', '', esbmc_output, flags=re.MULTILINE)
+    esbmc_output = re.sub(r'.*iteratively.*', '', esbmc_output, flags=re.MULTILINE)
+
+    esbmc_output = re.sub(r'.*branches.*', '', esbmc_output, flags=re.MULTILINE) #The informations about branches are not necessary relevant to the understanding of bugs


Consider the following sample:

#include <string.h> #include <assert.h> int main() { char* value = "branches"; assert(strcmp("branches", value) == 0); return 0; }

Would produce the following ESBMC output:

ESBMC version 7.5.0 64-bit x86_64 linux Target: 64-bit little-endian x86_64-unknown-linux with esbmclibc Parsing sample.c Converting Generating GOTO Program GOTO program creation time: 0.097s GOTO program processing time: 0.002s Starting Bounded Model Checking Unwinding loop 8 iteration 1 file string.c line 96 column 12 function strcmp Unwinding loop 8 iteration 2 file string.c line 96 column 12 function strcmp Unwinding loop 8 iteration 3 file string.c line 96 column 12 function strcmp Unwinding loop 8 iteration 4 file string.c line 96 column 12 function strcmp Unwinding loop 8 iteration 5 file string.c line 96 column 12 function strcmp Unwinding loop 8 iteration 6 file string.c line 96 column 12 function strcmp Unwinding loop 8 iteration 7 file string.c line 96 column 12 function strcmp Unwinding loop 8 iteration 8 file string.c line 96 column 12 function strcmp Symex completed in: 0.002s (92 assignments) Slicing time: 0.000s (removed 90 assignments) Generated 19 VCC(s), 1 remaining after simplification (2 assignments) No solver specified; defaulting to Boolector Encoding remaining VCC(s) using bit-vector/floating-point arithmetic Encoding to solver time: 0.000s Solving with solver Boolector 3.2.3 Runtime decision procedure: 0.000s Building error trace [Counterexample] State 1 file sample.c line 6 column 2 function main thread 0 ---------------------------------------------------- Violated property: file sample.c line 6 column 2 function main assertion strcmp("branches", value) == 0 return_value$_strcmp$1 == 0 VERIFICATION FAILED

Putting that through a regex viewer removes from the Violated property: section the line assertion strcmp("branches", value) == 0.

There are many more regex that are too generic to be included. You filter out ESBMC output that is useful.

Yiannis128 · 2024-04-16T15:13:54Z

myenv/bin/Activate.ps1

What does this even do? Why was it included? You included it without a commit message explaining why it should be included in the project. Arguably, this could just be included as a generic script on the system.

Yiannis128 · 2024-04-16T15:28:51Z

So this is quite a messy pull request, the pull request has some big issues both on a technical level and an experimental level.

Technical Issues

You have deleted .env.example which is used as a template for people setting up ESBMC-AI before we make a better config solution.
Why is there a myenv folder with random and unexplained scripts inside? Also why is there a folder path/to/your/virtualenv also committed? The message committing them does not even explain what they're for.
The commit has a lot of empty tab spaces. Please make sure to format the code before committing it as it appears in the diff and makes it harder to read.
The source code files are not formatted.
In your regex, you have used flags=re.MULTILINE for operations that occur only on one line. There is no need for this and it would slow down the operations... You use flags=re.MULTILINE only when you want to apply the regex to each line.
Some regex operations could be done without regex. For example, filtering strings out would be faster without regex.
With this amount of regex, how can you guarantee it won't remove anything unintentionally? You said that you wrote tests, but please see this comment I've left. There are many more like this. If the current PR is merged, we will have random parts of the ESBMC output removed.
The tests you included have the data inside. Would be better to have the data outside the test code and load it in using fixtures.
You should add an option in the config to enable/disable this feature.

You shouldn't include everything in a commit that you work on. Only things that you would like included in the project.

Practical Issue

Do you have any figures that show any improvement in speed/accuracy of successful repairs? If the accuracy of repairs improves, we should know by how much. If it becomes worse, we need to know why, also it cannot be accepted. Also, how slow does this become due to all the regex?

Also, there's already a merge that happened that can reduce the ESBMC output to a few tokens as it uses only the violated property section only, please look at: #117 Maybe this PR can be added as another mode of that?

mihaistate · 2024-04-19T09:32:13Z

"1.Some regex operations could be done without regex. For example, filtering strings out would be faster without regex.
"
The regex pattern may prove to be more efficient than any other string method, we need abstract patterns that are able to remove sentences that will be phrased differentlty, but refer to the same idea, that is why I would suggest to stick to regex patterns. I agree that the string methods are faster, but they can't match all the phrases that point to the same idea

Yiannis128 · 2024-04-19T11:34:06Z

"1.Some regex operations could be done without regex. For example, filtering strings out would be faster without regex. " The regex pattern may prove to be more efficient than any other string method, we need abstract patterns that are able to remove sentences that will be phrased differentlty, but refer to the same idea, that is why I would suggest to stick to regex patterns. I agree that the string methods are faster, but they can't match all the phrases that point to the same idea

The problem I have been trying to explain is that REGEX patterns used carelessly, like in this case where you are nuking lines with the word “branches” is not good design. I showed you an example where the useful output was altered.

Also, REGEX is extremely slow. Direct string manipulation is more rigid but also miles faster. So I don't know what you mean by: "The regex pattern may prove to be more efficient", efficient in terms of what?

Removed the 'flags=re.MULTILINE' argument for the regex patterns that work properly without it Used the re.compile method for some patterns to reduce time complexity Signed-off-by: mihai.state <mihaita.state@yahoo.com>

mihaistate05 added 5 commits February 17, 2024 17:37

Fix minor changes

d8f6704

Added test for the GPT-4-turbo-preview AI model

c7ded6f

Yiannis128 reviewed Apr 16, 2024

View reviewed changes

Optimized the regex pattens to be more time efficient

36798da

Removed the 'flags=re.MULTILINE' argument for the regex patterns that work properly without it Used the re.compile method for some patterns to reduce time complexity Signed-off-by: mihai.state <mihaita.state@yahoo.com>

mihaistate force-pushed the master branch from c7969d0 to 36798da Compare June 27, 2024 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise the output to reduce token consumption #113

Optimise the output to reduce token consumption #113

mihaistate commented Mar 26, 2024

Yiannis128 Apr 16, 2024

Yiannis128 Apr 16, 2024

Yiannis128 Apr 16, 2024

Yiannis128 Apr 16, 2024

Yiannis128 commented Apr 16, 2024

mihaistate commented Apr 19, 2024 •

edited

Loading

Yiannis128 commented Apr 19, 2024

Optimise the output to reduce token consumption #113

Are you sure you want to change the base?

Optimise the output to reduce token consumption #113

Conversation

mihaistate commented Mar 26, 2024

Yiannis128 Apr 16, 2024

Choose a reason for hiding this comment

Yiannis128 Apr 16, 2024

Choose a reason for hiding this comment

Yiannis128 Apr 16, 2024

Choose a reason for hiding this comment

Yiannis128 Apr 16, 2024

Choose a reason for hiding this comment

Yiannis128 commented Apr 16, 2024

Technical Issues

Practical Issue

mihaistate commented Apr 19, 2024 • edited Loading

Yiannis128 commented Apr 19, 2024

mihaistate commented Apr 19, 2024 •

edited

Loading