New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

FEAT: Crescendo part 1 (scorers) #203

Closed

cseifert1 wants to merge 16 commits into Azure:main from cseifert1:crescendo

Contributor

cseifert1 commented May 15, 2024

Description

Implement the first part of the crescendo attack strategy. So far adding the relevant scorers and prompt templates.

Tests and Documentation

Added unit tests and scoring notebook sections to test out the scorers.

cseifert1 added 16 commits

April 23, 2024 20:06


          initial attack and scorer prompts

350fb6c


          initial attack and scorer prompts

cce7369


          Merge branch 'crescendo' of https://github.com/cseifert1/PyRIT into c…

366744e

…rescendo


          remove some incomplete files

f0477de


          initial attack and scorer prompts

b5e6891


          initial attack and scorer prompts

7e63648


          remove some incomplete files

a0c595e


          Merge branch 'crescendo' of https://github.com/cseifert1/PyRIT into c…

2a05091

…rescendo


          fix unit tests after rebase

036e416


          fixed the pre-commit issue

451608b


          created corresponding jupyter notebook

71e12f0


          fix heading

12f7cd1


          Merge remote-tracking branch 'upstream/main' into crescendo

b27dd0f


          modified based on merge changes

acf5cd4


          Merge branch 'crescendo' of https://github.com/cseifert1/PyRIT into c…

c2aeffd

…rescendo


          ran memory notebook

ffcc610

rlundeen2 reviewed

View reviewed changes

doc/code/memory/memory.ipynb

		@@ -2,7 +2,7 @@
		"cells": [
		{

Contributor

rlundeen2 May 15, 2024

can remove this file for now

rlundeen2 reviewed

View reviewed changes

pyrit/models/score.py

@@ @@ -113,3 +115,10 @@ def _validate(self, scorer_type, score_value): @@
                                   raise ValueError(f"Float scale scorers must have a score value between 0 and 1. Got {score_value}")
                           except ValueError:
                               raise ValueError(f"Float scale scorers require a numeric score value. Got {score_value}")
+                      elif scorer_type == "severity":
+                          try:

Contributor

rlundeen2 May 15, 2024

I think we can use float_scale for this; there is a method to normalize any value to one between 0.0 and 1.0

IMO we should delete this

Contributor Author

cseifert1 May 15, 2024

I dont think we should. The severity documented are specific int values. While we can represent it as a float (e.g. 2-> 0.02) I think it would be confusing for a caller.

rlundeen2 reviewed

View reviewed changes

doc/demo/6_crescendo.py

+              # These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms.
+              # However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow
+              # the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo.
+              # Unlike existing jailbreak methods, Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner.

Contributor

rlundeen2 May 15, 2024 •

edited

Loading

This looks like a great start!

But can we remove all non-scoring pieces to this PR? E.g. this, anything and anything with the orchestrator

E.g. remove this, orchestrator, etc. (you can keep the code for future PRs, just branch off of your current branch and remove them from this from the current PR)

rlundeen2 reviewed

View reviewed changes

pyrit/datasets/orchestrators/crescendo/variant_1.yaml

+              description: |
+                A variant of the crescendo attack technique
+              harm_category: NA
+              author: Ahmed Salem

Contributor

rlundeen2 May 15, 2024

I'd remove these for now

rlundeen2 reviewed

View reviewed changes

pyrit/models/models.py

		@@ -264,6 +264,13 @@ def __str__(self):
		return self.strategy.apply_custom_metaprompt_parameters(**self.kwargs)


		class ToolCall(BaseModel):

Contributor

rlundeen2 May 15, 2024

it's not clear to me what this is

rlundeen2 reviewed

View reviewed changes

pyrit/score/azure_content_filter.py


		self._harm_category = harm_category

		if azure_content_safety_key is not None and azure_content_safety_endpoint is not None:

Contributor

rlundeen2 May 15, 2024

Can we add env keys for this and get the default values? See most of our targets on examples of how to do this. It should also be added to our env_example :)

Contributor

rlundeen2 May 15, 2024

e.g. self._api_key = default_values.get_required_value(
env_var_name=self.API_KEY_ENVIRONMENT_VARIABLE, passed_value=api_key
)

rlundeen2 reviewed

View reviewed changes

pyrit/score/azure_content_filter.py


		request = AnalyzeTextOptions(text=request_response.converted_value, output_type="EightSeverityLevels")
		# Analyze text and geting the results for the category specified

Contributor

rlundeen2 May 15, 2024

we should use AnalyzeTextAsync and await the method

rlundeen2 reviewed

View reviewed changes

pyrit/score/azure_content_filter.py

+                      """
+                      self.validate(request_response)
+                      request = AnalyzeTextOptions(text=request_response.converted_value, output_type="EightSeverityLevels")

Contributor

rlundeen2 May 15, 2024

It would be nice to do images here too, and just AnaylzeImage depending ont the reqeust_response data type

Contributor

rlundeen2 May 15, 2024

(not a blocker, but nice to have)

rlundeen2 reviewed

View reviewed changes

pyrit/score/azure_content_filter.py

+                      # Analyze text and geting the results for the category specified
+                      response = self._azureCFClient.analyze_text(request)
+                      result = next((item for item in response.categories_analysis if item.category == self._harm_category), None)

Contributor

rlundeen2 May 15, 2024

Set score_type to "float_scale"

And for score_value call this (I think they range from 1 to 8?)

score_value = self.scale_value_float(float(parsed_response["score_value"]), 1, 8)

rlundeen2 reviewed

View reviewed changes

pyrit/score/azure_content_filter.py

+                      score = Score(
+                          score_type="severity",
+                          score_value=result.severity,
+                          score_value_description="severity",

Contributor

rlundeen2 May 15, 2024

description can be "none" or to be consistent it would be with whatever azure defines as the specific descriptions for each of the levels

rlundeen2 reviewed

View reviewed changes

pyrit/score/azure_content_filter.py

		from pyrit.models import PromptRequestPiece


		class AzureContentFilter(Scorer):

Contributor

rlundeen2 May 15, 2024

I'd rename to AzureContentFilterScorer (and rename this class)

rlundeen2 reviewed

View reviewed changes

Contributor

rlundeen2 left a comment

For this PR, you may want to just have "azure_content_filter_scorer". That is a concrete piece of work that is close to done and would immediately provide value. Then You could title this PR "Azure Content Filter Scorer"

The second PR I would include the other three scorers: eval_prompt, meta_judge_prompt, and refuse_judge_propmt

I really like these three prompts eval_prompt, meta_judge_prompt, and refuse_judge_propmt. Broadly, I think they should be broken up into two separate scorers

SelfAskVerifyScore, which contains meta_judge prompt and is a TrueFalse scorer
SelfAskConversationObjectiveScorer, I would also probably make this a TrueFalse scorer, and put the value from 0 to 1 in metadata. I like that result_percentage, we may add something like that to all TrueFalse scorers, but for now I think metadata will allow you to do operations on it.

IMO this would allow us to reuse, and each of these will be useful in other scoring situations.

There are some nits too, like the prompts should go in datasets.

I also really like the various crescendo scorers. I would make that the second pr. Broadly, I think they should be broken up into three separate scorers

dlmgary mentioned this pull request

FEAT: Implements Crescendo-style attack based on system prompt. #237

Merged

Contributor

romanlutz commented Jul 12, 2024

Closing in favor of #275

romanlutz closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet