FEAT add many-shot jailbreaking #130

romanlutz · 2024-04-02T18:01:39Z

Is your feature request related to a problem? Please describe.

Many-shot jailbreaking as described in https://www.anthropic.com/research/many-shot-jailbreaking is not yet available in PyRIT.

Describe the solution you'd like

From a first look, it seems like all we'd need to support this is a set of (let's say 256 or more) Question/Answer pairs like in the paper.

Describe alternatives you've considered, if relevant

It's worth checking if they made it available somewhere or if there's such a Q/A dataset already.

Additional context

KutalVolkan · 2024-06-03T03:09:11Z

Hi @romanlutz,

I'd like to help out with implementing the many-shot jailbreaking feature. I'll read the paper, and if your suggestion about needing 256+ Q/A pairs seems to be correct, I'll start with that. Since this will be my first time contributing to an open-source project, could you please provide some guidance on the general steps for contributing?

Thanks!
Volkan

romanlutz · 2024-06-03T03:36:49Z

Hi @KutalVolkan !

Thanks for reaching out! We'd love to collaborate on this one 🙂 I see this as two tasks really:

adding a mechanism to craft a prompt with arbitrarily many examples plus the malicious prompt we want the LLM to answer
collecting the examples

For the former, we have prompt templates under PyRIT/datasets/prompt_templates. Perhaps it's possible to write one that has one placeholder for where the examples would go, but then have a new subclass of PromptTemplate that can insert all the examples rather than just one? Something like

template = ManyShotTemplate.from_yaml_file(...)  # same as PromptTemplate
template.apply_parameters(prompt=prompt, examples=examples)

Where examples would be the Q&A pairs.

And then a simple orchestrator like PromptSendingOrchestrator could handle sending it to targets.

For the latter, we don't really want to become the place where all the bad stuff from the internet is collected 😄 Ideally, we would want to find these in another repository and just have an import function. Plus, people can always generate or write their own set, of course.

Regarding contributing guidelines there should be plenty in the doc folder.

Please let me know if you have questions or want to chat about any of these points! I may very well have skipped something...

KutalVolkan · 2024-06-03T03:44:18Z

Hi @romanlutz,

I'll start by reading the paper and then implement the many-shot jailbreaking feature as you described. I'll keep you updated on my progress.

Thanks,
Volkan

romanlutz · 2024-06-03T03:49:03Z

Fantastic!

I guess I made an assumption here that the "many shots" are just in one prompt. Another option would be to "fake" the conversation history which is possible with some plain model endpoints but rather unlikely with full generative AI systems (which should prevent you from doing that). So I think I'd go with the single prompt and hence the prompt template makes sense.

Happy to discuss options, though!

KutalVolkan · 2024-06-04T13:10:12Z

Hello @romanlutz,

I just wanted to inform you that, according to the paper, we can use this uncensored model: WizardLM-13B-Uncensored.

We can use it to provide answers to the following questions in the "behavior" column of this dataset: harmbench_behaviors_text_all.csv.

I tried to run the model locally and encountered an issue:

UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(

This issue is likely not solvable according to this discussion: GitHub Issue.

Therefore, I thought about using the inference endpoints from Hugging Face instead.

P.S. Your approach of using a single prompt definitely makes sense, and I will go with that.

romanlutz · 2024-06-04T18:16:57Z

We usually use model endpoint in Azure, so I can't comment much on running locally. Maybe using an earlier version of torch helps? PyRIT shouldn't be too opinionated on which one you use.

The list of prompts you found makes sense. Still, we'd have to check in the responses somewhere. As mentioned before, I'd prefer to avoid making PyRIT the place where all the bad stuff on the internet is collected. Maybe it makes sense to put that Q&A dataset in a separate repo from where we can import it? Just thinking out loud here...

KutalVolkan · 2024-06-12T19:28:32Z

Hey @romanlutz,

The dataset is ready, and I will place the Q&A dataset in a separate repo. However, I will need some time to implement everything. I have a deadline on June 20, so I aim to have it all (implementation and dataset) completed by the end of June. Thanks for your patience!

romanlutz · 2024-06-13T04:22:46Z

Amazing, @KutalVolkan ! No pressure, of course. I'll try to provide timely feedback as usual. If you have questions please feel free to reach out.

KutalVolkan · 2024-06-22T07:53:23Z

Hi @romanlutz

I have completed the code implementation for the many-shot jailbreaking feature as discussed earlier.

Code Integration

Code Placement: Added code in the relevant positions across the project to support many-shot jailbreaking.
Demo: A demo script has been built to showcase the feature's functionality. The demo script is available here.

Dataset Integration

Dataset Location: The required dataset for the many-shot jailbreaking feature has been added and is accessible here.
Dynamic Import and Processing: The code dynamically imports and processes the dataset to generate the necessary Q&A pairs.

Important Links

Class Implementation: Many-Shot-Jailbreak Class
Feature Import via API Call and User JSON: API and JSON Usage
YAML File for Prompt Template: YAML Template

Testing Phase

The feature is currently undergoing testing. During the testing phase, one major challenge encountered is the rate limit imposed by OpenAI when a large number of examples (e.g., 100) are used.

Known Issue

Rate Limit Error:
If the num_examples parameter is set too high, an error may occur due to OpenAI's rate limits.

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for gpt-3.5-turbo in organization YOUR-ID on tokens per min (TPM): Limit 60000, Requested 74536. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

Next Steps:

Dataset Expansion: The current dataset contains 100 examples. I will expand this dataset to include more examples, aiming to reach our suggested 256+ Q&A pairs.
Review and Optimize: I will conduct some reviews and optimizations, including thorough testing and verification of the entire approach, to ensure there are no logical mistakes regarding the dataset, user interactions, assistant responses, and overall methodology.

Could you please provide feedback on these steps? Any suggestions or improvements are welcome.

romanlutz · 2024-06-22T14:55:11Z

Amazing, @KutalVolkan !

I'll take a closer look by Monday at the latest. Don't worry about 429s. This sort of attack would require a larger context, but most of the recent models have larger ones. GPT-4-32k has 32k, for example.

For the rate limit, we've recently added retries. You can search for pyrit_target_retry for details but it exists at the target level so you don't need to worry about it.

If you create a draft PR it's easier to give detailed feedback. Just a suggestion depending on whether you have everything on a branch yet.

Again, great progress and I'll get back to you shortly.

…zure#267)- Updated 7_many_shot_jailbreak.py to use the new fetch_many_shot_jailbreaking_examples function and added a scorer.- Updated 8_test_seclists_bias_testing.py to use the new fetch_seclists_bias_testing_examples function and added a scorer.- Implemented caching logic in fetch_examples.py to enhance efficiency.- Added pytest tests for fetch_examples and many_shot_template.- Improved code readability and maintainability.

romanlutz added enhancement New feature or request help wanted Extra attention is needed labels Apr 2, 2024

romanlutz changed the title ~~FEAT add many-short jailbreaking~~ FEAT add many-shot jailbreaking Apr 22, 2024

romanlutz assigned KutalVolkan Aug 1, 2024

romanlutz linked a pull request Aug 1, 2024 that will close this issue

FEAT Add many-shot jailbreaking feature implementation #254

Merged

romanlutz closed this as completed in #254 Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT add many-shot jailbreaking #130

FEAT add many-shot jailbreaking #130

romanlutz commented Apr 2, 2024

KutalVolkan commented Jun 3, 2024

romanlutz commented Jun 3, 2024

KutalVolkan commented Jun 3, 2024

romanlutz commented Jun 3, 2024

KutalVolkan commented Jun 4, 2024 •

edited

Loading

romanlutz commented Jun 4, 2024

KutalVolkan commented Jun 12, 2024

romanlutz commented Jun 13, 2024

KutalVolkan commented Jun 22, 2024 •

edited

Loading

romanlutz commented Jun 22, 2024

FEAT add many-shot jailbreaking #130

FEAT add many-shot jailbreaking #130

Comments

romanlutz commented Apr 2, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered, if relevant

Additional context

KutalVolkan commented Jun 3, 2024

romanlutz commented Jun 3, 2024

KutalVolkan commented Jun 3, 2024

romanlutz commented Jun 3, 2024

KutalVolkan commented Jun 4, 2024 • edited Loading

romanlutz commented Jun 4, 2024

KutalVolkan commented Jun 12, 2024

romanlutz commented Jun 13, 2024

KutalVolkan commented Jun 22, 2024 • edited Loading

Code Integration

Dataset Integration

Important Links

Testing Phase

Known Issue

Next Steps:

romanlutz commented Jun 22, 2024

KutalVolkan commented Jun 4, 2024 •

edited

Loading

KutalVolkan commented Jun 22, 2024 •

edited

Loading