Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Features] Multi-modal Jailbreaking Attack on LLaVA #587

Merged
merged 62 commits into from
May 3, 2024

Conversation

DavidLee528
Copy link
Contributor

@DavidLee528 DavidLee528 commented Apr 9, 2024

Thanks for your review.

This PR include:

  • New Generator: LLaVA (Image+Text->Text)
  • New multi-modal jailbreak attack probe
  • New multi-modal jailbreak attack detector

Please let me know if there are anything I can do better.

@DavidLee528 DavidLee528 changed the title [New Feature] Multi-modal Jailbreaking Attack on LLaVA [New Features] Multi-modal Jailbreaking Attack on LLaVA Apr 9, 2024
@leondz leondz added probes Content & activity of LLM probes detectors work on code that inherits from or manages Detector generators Interfaces with LLMs new plugin Describes an entirely new probe, detector, generator or harness labels Apr 9, 2024
@leondz
Copy link
Owner

leondz commented Apr 9, 2024

Wow, thanks for this! We'll get it reviewed

.gitignore Outdated Show resolved Hide resolved
Copy link
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good generally. I think we want to put the generator into huggingface.py and add some checks to the probe so we can ensure we're operating against the right generator type.

garak/detectors/visual_jailbreak.py Outdated Show resolved Hide resolved
garak/generators/llava.py Outdated Show resolved Hide resolved
garak/probes/visual_jailbreak.py Show resolved Hide resolved
@DavidLee528
Copy link
Contributor Author

I think the class name should be FigStep rather than VisualJailbreak for garak/detectors/visual_jailbreak.py and garak/probes/visual_jailbreak.py because the attack method FigStep is a subset of visual jailbreak attack. I think there will be more visual jailbreak attack in the future. (65f94cc)

DavidLee528 and others added 6 commits April 30, 2024 22:31
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Tianhao Li <35065046+DavidLee528@users.noreply.github.com>
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Tianhao Li <35065046+DavidLee528@users.noreply.github.com>
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Tianhao Li <35065046+DavidLee528@users.noreply.github.com>
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Tianhao Li <35065046+DavidLee528@users.noreply.github.com>
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Tianhao Li <35065046+DavidLee528@users.noreply.github.com>
Co-authored-by: Leon Derczynski <leonderczynski@gmail.com>
Signed-off-by: Tianhao Li <35065046+DavidLee528@users.noreply.github.com>
@DavidLee528 DavidLee528 requested a review from leondz April 30, 2024 14:49
Copy link
Owner

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, great idea to include FigStep/SafeBench.

The image collection is a little large, and also I want to be careful about including another project's code. Could we implement a pattern where:

  • The image files are not distributed with garak
  • When first run, the probe checks for the image files (perhaps in the resources/ directory)
  • If the image files are not there, we download from https://github.com/ThuCCSLab/FigStep/tree/main/data/images/SafeBench
  • The default probe uses 80 images
  • Optionally, a "Full" probe uses more images, but is deactivated by default. This can inherit from the 80 image version (or vice versa) - that's a pretty common pattern in garak
  • The FigStep paper needs to be cited/recognised somewhere in the garak code, maybe in a docstring for the FigStep probe? With a reference and paper link? This seems appropriate enough

Thanks so much for working on this, this is very close and will be a big change when it hits. Really appreciate your contributions!

garak/resources/visual_jailbreak/SafeBench/screenshots.zip Outdated Show resolved Hide resolved
@DavidLee528
Copy link
Contributor Author

The FigStep paper needs to be cited/recognised somewhere in the garak code, maybe in a docstring for the FigStep probe? With a reference and paper link? This seems appropriate enough

Paper title, arxiv link, and reference in acm format of FigStep are added in docstring in 39c57a2.

@DavidLee528
Copy link
Contributor Author

  • The default probe uses 80 images
  • Optionally, a "Full" probe uses more images, but is deactivated by default. This can inherit from the 80 image version (or vice versa) - that's a pretty common pattern in garak

Full (size 500) and small (size 80) of FigStep probe classes have been added in de6c6ef.

Additionally, size check logic for self.prompts of those two classes is added in 6784255 and ce23255 respectively.

@leondz leondz self-requested a review May 2, 2024 10:45
Copy link
Owner

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detector needs improvement, but this is probably best done with LLM-as-a-judge + the FigStep "instruction" columns (cf e.g. https://github.com/ThuCCSLab/FigStep/blob/main/data/question/SafeBench-Tiny.csv)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to work on this, but I'm happy for that to be tracked in a separate issue/PR

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinging llm-as-a-judge issue: #419

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! We are working on this now.

Copy link
Contributor Author

@DavidLee528 DavidLee528 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can download two different version (tiny and full) of dataset SafeBench from external repo now!

garak/probes/visual_jailbreak.py Show resolved Hide resolved
garak/probes/visual_jailbreak.py Outdated Show resolved Hide resolved
@leondz leondz merged commit bf57db6 into leondz:main May 3, 2024
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
detectors work on code that inherits from or manages Detector generators Interfaces with LLMs new plugin Describes an entirely new probe, detector, generator or harness probes Content & activity of LLM probes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants