-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Crescendo part 1 (scorers) #203
Conversation
@@ -2,7 +2,7 @@ | |||
"cells": [ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove this file for now
@@ -113,3 +115,10 @@ def _validate(self, scorer_type, score_value): | |||
raise ValueError(f"Float scale scorers must have a score value between 0 and 1. Got {score_value}") | |||
except ValueError: | |||
raise ValueError(f"Float scale scorers require a numeric score value. Got {score_value}") | |||
elif scorer_type == "severity": | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use float_scale for this; there is a method to normalize any value to one between 0.0 and 1.0
IMO we should delete this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we should. The severity documented are specific int values. While we can represent it as a float (e.g. 2-> 0.02) I think it would be confusing for a caller.
# These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. | ||
# However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow | ||
# the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. | ||
# Unlike existing jailbreak methods, Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start!
But can we remove all non-scoring pieces to this PR? E.g. this, anything and anything with the orchestrator
E.g. remove this, orchestrator, etc. (you can keep the code for future PRs, just branch off of your current branch and remove them from this from the current PR)
description: | | ||
A variant of the crescendo attack technique | ||
harm_category: NA | ||
author: Ahmed Salem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove these for now
@@ -264,6 +264,13 @@ def __str__(self): | |||
return self.strategy.apply_custom_metaprompt_parameters(**self.kwargs) | |||
|
|||
|
|||
class ToolCall(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear to me what this is
|
||
self._harm_category = harm_category | ||
|
||
if azure_content_safety_key is not None and azure_content_safety_endpoint is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add env keys for this and get the default values? See most of our targets on examples of how to do this. It should also be added to our env_example :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. self._api_key = default_values.get_required_value(
env_var_name=self.API_KEY_ENVIRONMENT_VARIABLE, passed_value=api_key
)
|
||
request = AnalyzeTextOptions(text=request_response.converted_value, output_type="EightSeverityLevels") | ||
# Analyze text and geting the results for the category specified | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use AnalyzeTextAsync and await the method
""" | ||
self.validate(request_response) | ||
|
||
request = AnalyzeTextOptions(text=request_response.converted_value, output_type="EightSeverityLevels") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to do images here too, and just AnaylzeImage depending ont the reqeust_response data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a blocker, but nice to have)
# Analyze text and geting the results for the category specified | ||
|
||
response = self._azureCFClient.analyze_text(request) | ||
result = next((item for item in response.categories_analysis if item.category == self._harm_category), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set score_type to "float_scale"
And for score_value call this (I think they range from 1 to 8?)
score_value = self.scale_value_float(float(parsed_response["score_value"]), 1, 8)
score = Score( | ||
score_type="severity", | ||
score_value=result.severity, | ||
score_value_description="severity", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description can be "none" or to be consistent it would be with whatever azure defines as the specific descriptions for each of the levels
from pyrit.models import PromptRequestPiece | ||
|
||
|
||
class AzureContentFilter(Scorer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename to AzureContentFilterScorer (and rename this class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR, you may want to just have "azure_content_filter_scorer". That is a concrete piece of work that is close to done and would immediately provide value. Then You could title this PR "Azure Content Filter Scorer"
The second PR I would include the other three scorers: eval_prompt, meta_judge_prompt, and refuse_judge_propmt
I really like these three prompts eval_prompt, meta_judge_prompt, and refuse_judge_propmt. Broadly, I think they should be broken up into two separate scorers
- SelfAskVerifyScore, which contains meta_judge prompt and is a TrueFalse scorer
- SelfAskConversationObjectiveScorer, I would also probably make this a TrueFalse scorer, and put the value from 0 to 1 in metadata. I like that result_percentage, we may add something like that to all TrueFalse scorers, but for now I think metadata will allow you to do operations on it.
IMO this would allow us to reuse, and each of these will be useful in other scoring situations.
There are some nits too, like the prompts should go in datasets.
I also really like the various crescendo scorers. I would make that the second pr. Broadly, I think they should be broken up into three separate scorers
Closing in favor of #275 |
Description
Implement the first part of the crescendo attack strategy. So far adding the relevant scorers and prompt templates.
Tests and Documentation
Added unit tests and scoring notebook sections to test out the scorers.