Add red teaming orchestrators to replace RedTeamingBot #84

romanlutz · 2024-03-05T07:37:56Z

Description

This PR adds the BaseRedTeamingOrchestrator abstract class as well as two implementation classes ScoringRedTeamingOrchestrator and EndTokenRedTeamingOrchestrator. With these changes, a few others were necessary and are therefore part of this PR:

Introduced the AttackStrategy that encapsulates the functionality previously spread over 3 input args of RedTeamingBot (attack_strategy, attack_strategy_kwargs, and conversation_objective) and allows for just string inputs as strategies, too.
Every reference to RedTeamingBot needed to be updated and all the logic replaced. This includes several notebooks and documentation files.
Added GandalfTarget which does not support async.

What is not included here?

Detection of a second conversation (an orchestrator should only be used for 1 right now)
support for 1-to-many converters (we currently raise an exception in such a case)

Tests

no new tests required
new tests added
existing tests adjusted

Documentation

no documentation changes needed
documentation added or edited
example notebook added or updated

…l need updates

pyrit/orchestrator/red_teaming_orchestrator.py

…redteamingbot_orchestrator

pyrit/prompt_target/aml_online_endpoint_chat_target.py

…ng and end-token support.

…redteamingbot_orchestrator

rlundeen2

Great work @romanlutz! Love the architecture and separation. Really good design

Only thing missing is tests, but everything looks good to me

pyrit/orchestrator/base_red_teaming_orchestrator.py

pyrit/orchestrator/end_token_red_teaming_orchestrator.py

pyrit/orchestrator/base_red_teaming_orchestrator.py

romanlutz · 2024-03-09T03:22:15Z

Great work @romanlutz! Love the architecture and separation. Really good design

Only thing missing is tests, but everything looks good to me

Thanks! I've been looking into tests the last couple of days. There are a few things that we still need to settle on, including

The memory now includes two conversations (the one with RT bot and the one with the target) with separate conversation IDs. When we run get_memory() on the orchestrator, what do we actually expect in return? Just the conversation with the target? Or both?
If you provide converters that add more prompts, then we suddenly have multiple conversations to manage. What is the expected outcome? For example, the red teaming orchestrator starts with 1 prompt, the converter makes it 2 (the original and a rewritten one). Now we have two conversations with the target. Should it continue both conversations? If so, then on the next go-around we have 4. This could potentially explode if there are too many turns. I'm leaning towards adding a validation of input converters. It would run a prompt through the converters and raise an exception if they generate more than 1. Anyway, we need to talk about this one...
Do we expect people to reuse their orchestrator objects? If so, then the conversation ID will not get reset (at least in the current implementation), meaning that we can't rely on the memory to tell us if we're at the start of a conversation (as it contains prior ones). This is important for deciding what to send as the initial prompt, but also more generally.

rlundeen2 · 2024-03-11T20:53:44Z

Great work @romanlutz! Love the architecture and separation. Really good design
Only thing missing is tests, but everything looks good to me

Thanks! I've been looking into tests the last couple of days. There are a few things that we still need to settle on, including

The memory now includes two conversations (the one with RT bot and the one with the target) with separate conversation IDs. When we run get_memory() on the orchestrator, what do we actually expect in return? Just the conversation with the target? Or both?

I would say get_memory() is not a required function, so depends on the orchestrator. But to me it makes the most sense to just return the conversation with the target?

If you provide converters that add more prompts, then we suddenly have multiple conversations to manage. What is the expected outcome? For example, the red teaming orchestrator starts with 1 prompt, the converter makes it 2 (the original and a rewritten one). Now we have two conversations with the target. Should it continue both conversations? If so, then on the next go-around we have 4. This could potentially explode if there are too many turns. I'm leaning towards adding a validation of input converters. It would run a prompt through the converters and raise an exception if they generate more than 1. Anyway, we need to talk about this one...

Good point. We may want to add a constraint on certain types of converters so they can only do one. But let's design it.

Do we expect people to reuse their orchestrator objects? If so, then the conversation ID will not get reset (at least in the current implementation), meaning that we can't rely on the memory to tell us if we're at the start of a conversation (as it contains prior ones). This is important for deciding what to send as the initial prompt, but also more generally.

I think this is okay. If we run across use cases where folks want to reuse it, we can add that functionality, but I'd expect it to be manual e.g. reset_target_conversation or something

…redteamingbot_orchestrator

pyrit/memory/file_memory.py

pyrit/models.py

pyrit/orchestrator/base_red_teaming_orchestrator.py

pyrit/prompt_converter/ascii_art_converter.py

pyrit/prompt_converter/string_join_converter.py

pyrit/memory/file_memory.py

pyrit/orchestrator/base_red_teaming_orchestrator.py

…redteamingbot_orchestrator

Will need to address these at a later point and will follow up with Gary. Would prefer to unblock the team by merging this tonight if possible.

working commit with notebooks; tests, docs, and other references stil…

c8722e5

…l need updates

rlundeen2 reviewed Mar 5, 2024

View reviewed changes