-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT add many-shot jailbreaking #130
Comments
Hi @romanlutz, I'd like to help out with implementing the many-shot jailbreaking feature. I'll read the paper, and if your suggestion about needing 256+ Q/A pairs seems to be correct, I'll start with that. Since this will be my first time contributing to an open-source project, could you please provide some guidance on the general steps for contributing? Thanks! |
Hi @KutalVolkan ! Thanks for reaching out! We'd love to collaborate on this one 🙂 I see this as two tasks really:
For the former, we have prompt templates under PyRIT/datasets/prompt_templates. Perhaps it's possible to write one that has one placeholder for where the examples would go, but then have a new subclass of PromptTemplate that can insert all the examples rather than just one? Something like
Where examples would be the Q&A pairs. And then a simple orchestrator like PromptSendingOrchestrator could handle sending it to targets. For the latter, we don't really want to become the place where all the bad stuff from the internet is collected 😄 Ideally, we would want to find these in another repository and just have an import function. Plus, people can always generate or write their own set, of course. Regarding contributing guidelines there should be plenty in the doc folder. Please let me know if you have questions or want to chat about any of these points! I may very well have skipped something... |
Hi @romanlutz, I'll start by reading the paper and then implement the many-shot jailbreaking feature as you described. I'll keep you updated on my progress. Thanks, |
Fantastic! I guess I made an assumption here that the "many shots" are just in one prompt. Another option would be to "fake" the conversation history which is possible with some plain model endpoints but rather unlikely with full generative AI systems (which should prevent you from doing that). So I think I'd go with the single prompt and hence the prompt template makes sense. Happy to discuss options, though! |
Hello @romanlutz, I just wanted to inform you that, according to the paper, we can use this uncensored model: WizardLM-13B-Uncensored. We can use it to provide answers to the following questions in the "behavior" column of this dataset: harmbench_behaviors_text_all.csv. I tried to run the model locally and encountered an issue:
This issue is likely not solvable according to this discussion: GitHub Issue. Therefore, I thought about using the inference endpoints from Hugging Face instead. P.S. Your approach of using a single prompt definitely makes sense, and I will go with that. |
We usually use model endpoint in Azure, so I can't comment much on running locally. Maybe using an earlier version of torch helps? PyRIT shouldn't be too opinionated on which one you use. The list of prompts you found makes sense. Still, we'd have to check in the responses somewhere. As mentioned before, I'd prefer to avoid making PyRIT the place where all the bad stuff on the internet is collected. Maybe it makes sense to put that Q&A dataset in a separate repo from where we can import it? Just thinking out loud here... |
Hey @romanlutz, The dataset is ready, and I will place the Q&A dataset in a separate repo. However, I will need some time to implement everything. I have a deadline on June 20, so I aim to have it all (implementation and dataset) completed by the end of June. Thanks for your patience! |
Amazing, @KutalVolkan ! No pressure, of course. I'll try to provide timely feedback as usual. If you have questions please feel free to reach out. |
Hi @romanlutz I have completed the code implementation for the many-shot jailbreaking feature as discussed earlier. Code Integration
Dataset Integration
Important Links
Testing PhaseThe feature is currently undergoing testing. During the testing phase, one major challenge encountered is the rate limit imposed by OpenAI when a large number of examples (e.g., 100) are used. Known Issue
Next Steps:
Could you please provide feedback on these steps? Any suggestions or improvements are welcome. |
Amazing, @KutalVolkan ! I'll take a closer look by Monday at the latest. Don't worry about 429s. This sort of attack would require a larger context, but most of the recent models have larger ones. GPT-4-32k has 32k, for example. For the rate limit, we've recently added retries. You can search for If you create a draft PR it's easier to give detailed feedback. Just a suggestion depending on whether you have everything on a branch yet. Again, great progress and I'll get back to you shortly. |
…zure#267)- Updated 7_many_shot_jailbreak.py to use the new fetch_many_shot_jailbreaking_examples function and added a scorer.- Updated 8_test_seclists_bias_testing.py to use the new fetch_seclists_bias_testing_examples function and added a scorer.- Implemented caching logic in fetch_examples.py to enhance efficiency.- Added pytest tests for fetch_examples and many_shot_template.- Improved code readability and maintainability.
Is your feature request related to a problem? Please describe.
Many-shot jailbreaking as described in https://www.anthropic.com/research/many-shot-jailbreaking is not yet available in PyRIT.
Describe the solution you'd like
From a first look, it seems like all we'd need to support this is a set of (let's say 256 or more) Question/Answer pairs like in the paper.
Describe alternatives you've considered, if relevant
It's worth checking if they made it available somewhere or if there's such a Q/A dataset already.
Additional context
The text was updated successfully, but these errors were encountered: