Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT add many-shot jailbreaking #130

Closed
romanlutz opened this issue Apr 2, 2024 · 10 comments · Fixed by #254
Closed

FEAT add many-shot jailbreaking #130

romanlutz opened this issue Apr 2, 2024 · 10 comments · Fixed by #254
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@romanlutz
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Many-shot jailbreaking as described in https://www.anthropic.com/research/many-shot-jailbreaking is not yet available in PyRIT.

Describe the solution you'd like

From a first look, it seems like all we'd need to support this is a set of (let's say 256 or more) Question/Answer pairs like in the paper.

Describe alternatives you've considered, if relevant

It's worth checking if they made it available somewhere or if there's such a Q/A dataset already.

Additional context

@romanlutz romanlutz added enhancement New feature or request help wanted Extra attention is needed labels Apr 2, 2024
@romanlutz romanlutz changed the title FEAT add many-short jailbreaking FEAT add many-shot jailbreaking Apr 22, 2024
@KutalVolkan
Copy link
Contributor

Hi @romanlutz,

I'd like to help out with implementing the many-shot jailbreaking feature. I'll read the paper, and if your suggestion about needing 256+ Q/A pairs seems to be correct, I'll start with that. Since this will be my first time contributing to an open-source project, could you please provide some guidance on the general steps for contributing?

Thanks!
Volkan

@romanlutz
Copy link
Contributor Author

Hi @KutalVolkan !

Thanks for reaching out! We'd love to collaborate on this one 🙂 I see this as two tasks really:

  • adding a mechanism to craft a prompt with arbitrarily many examples plus the malicious prompt we want the LLM to answer
  • collecting the examples

For the former, we have prompt templates under PyRIT/datasets/prompt_templates. Perhaps it's possible to write one that has one placeholder for where the examples would go, but then have a new subclass of PromptTemplate that can insert all the examples rather than just one? Something like

template = ManyShotTemplate.from_yaml_file(...)  # same as PromptTemplate
template.apply_parameters(prompt=prompt, examples=examples)

Where examples would be the Q&A pairs.

And then a simple orchestrator like PromptSendingOrchestrator could handle sending it to targets.

For the latter, we don't really want to become the place where all the bad stuff from the internet is collected 😄 Ideally, we would want to find these in another repository and just have an import function. Plus, people can always generate or write their own set, of course.

Regarding contributing guidelines there should be plenty in the doc folder.

Please let me know if you have questions or want to chat about any of these points! I may very well have skipped something...

@KutalVolkan
Copy link
Contributor

Hi @romanlutz,

I'll start by reading the paper and then implement the many-shot jailbreaking feature as you described. I'll keep you updated on my progress.

Thanks,
Volkan

@romanlutz
Copy link
Contributor Author

Fantastic!

I guess I made an assumption here that the "many shots" are just in one prompt. Another option would be to "fake" the conversation history which is possible with some plain model endpoints but rather unlikely with full generative AI systems (which should prevent you from doing that). So I think I'd go with the single prompt and hence the prompt template makes sense.

Happy to discuss options, though!

@KutalVolkan
Copy link
Contributor

KutalVolkan commented Jun 4, 2024

Hello @romanlutz,

I just wanted to inform you that, according to the paper, we can use this uncensored model: WizardLM-13B-Uncensored.

We can use it to provide answers to the following questions in the "behavior" column of this dataset: harmbench_behaviors_text_all.csv.

I tried to run the model locally and encountered an issue:

UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(

This issue is likely not solvable according to this discussion: GitHub Issue.

Therefore, I thought about using the inference endpoints from Hugging Face instead.

P.S. Your approach of using a single prompt definitely makes sense, and I will go with that.

@romanlutz
Copy link
Contributor Author

We usually use model endpoint in Azure, so I can't comment much on running locally. Maybe using an earlier version of torch helps? PyRIT shouldn't be too opinionated on which one you use.

The list of prompts you found makes sense. Still, we'd have to check in the responses somewhere. As mentioned before, I'd prefer to avoid making PyRIT the place where all the bad stuff on the internet is collected. Maybe it makes sense to put that Q&A dataset in a separate repo from where we can import it? Just thinking out loud here...

@KutalVolkan
Copy link
Contributor

Hey @romanlutz,

The dataset is ready, and I will place the Q&A dataset in a separate repo. However, I will need some time to implement everything. I have a deadline on June 20, so I aim to have it all (implementation and dataset) completed by the end of June. Thanks for your patience!

@romanlutz
Copy link
Contributor Author

Amazing, @KutalVolkan ! No pressure, of course. I'll try to provide timely feedback as usual. If you have questions please feel free to reach out.

@KutalVolkan
Copy link
Contributor

KutalVolkan commented Jun 22, 2024

Hi @romanlutz

I have completed the code implementation for the many-shot jailbreaking feature as discussed earlier.

Code Integration

  • Code Placement: Added code in the relevant positions across the project to support many-shot jailbreaking.
  • Demo: A demo script has been built to showcase the feature's functionality. The demo script is available here.

Dataset Integration

  • Dataset Location: The required dataset for the many-shot jailbreaking feature has been added and is accessible here.
  • Dynamic Import and Processing: The code dynamically imports and processes the dataset to generate the necessary Q&A pairs.

Important Links

Testing Phase

The feature is currently undergoing testing. During the testing phase, one major challenge encountered is the rate limit imposed by OpenAI when a large number of examples (e.g., 100) are used.

Known Issue

  • Rate Limit Error:
    If the num_examples parameter is set too high, an error may occur due to OpenAI's rate limits.
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for gpt-3.5-turbo in organization YOUR-ID on tokens per min (TPM): Limit 60000, Requested 74536. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

Next Steps:

  1. Dataset Expansion: The current dataset contains 100 examples. I will expand this dataset to include more examples, aiming to reach our suggested 256+ Q&A pairs.
  2. Review and Optimize: I will conduct some reviews and optimizations, including thorough testing and verification of the entire approach, to ensure there are no logical mistakes regarding the dataset, user interactions, assistant responses, and overall methodology.

Could you please provide feedback on these steps? Any suggestions or improvements are welcome.

@romanlutz
Copy link
Contributor Author

Amazing, @KutalVolkan !

I'll take a closer look by Monday at the latest. Don't worry about 429s. This sort of attack would require a larger context, but most of the recent models have larger ones. GPT-4-32k has 32k, for example.

For the rate limit, we've recently added retries. You can search for pyrit_target_retry for details but it exists at the target level so you don't need to worry about it.

If you create a draft PR it's easier to give detailed feedback. Just a suggestion depending on whether you have everything on a branch yet.

Again, great progress and I'll get back to you shortly.

KutalVolkan added a commit to KutalVolkan/PyRIT that referenced this issue Jul 12, 2024
…zure#267)- Updated 7_many_shot_jailbreak.py to use the new fetch_many_shot_jailbreaking_examples function and added a scorer.- Updated 8_test_seclists_bias_testing.py to use the new fetch_seclists_bias_testing_examples function and added a scorer.- Implemented caching logic in fetch_examples.py to enhance efficiency.- Added pytest tests for fetch_examples and many_shot_template.- Improved code readability and maintainability.
@romanlutz romanlutz linked a pull request Aug 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants