Skip to content

Datasets used for RLLM v3 Experiments on GPT2XL, Phi1.5 and Falcon-RW-1B

Notifications You must be signed in to change notification settings

migueldeguzman/RLLMv3-datasets

Repository files navigation

RLLMv3 Dataset for Jailbreak Experiment

This repository contains the dataset used in the experimental project focused on addressing the alignment problem in language models, specifically through the development and testing of the BetterDAN Jailbreak method against various state-of-the-art (SOTA) language models, including GPT2XL. The project highlights the effectiveness of the BetterDAN Jailbreak in eliciting harmful responses from these models and introduces a potential solution via a modified GPT2XL model (GPT2XL_RLLMv3) that demonstrates an average resistance rate of 67.8% against jailbreak attempts. This solution involves "Layered Morphology," a training approach aimed at aligning language models more closely with human values.

Warning

This repository contains content that many may find harmful or offensive. Reader discretion is advised. Please proceed with caution.

Overview

  • Why Jailbreaks?: Exploring the importance of assessing language model safety and the effectiveness of jailbreaks as a method to test these models.
  • The BetterDAN Jailbreak: Details on how the BetterDAN Jailbreak method works and its application in testing SOTA models.
  • SOTA Models Tested: A list of the state-of-the-art models that were subjected to jailbreak attempts, including outcomes.
  • Jailbreak Attacks on GPT2XL: Specific examples of how GPT2XL responded to jailbreak attempts and the modifications that led to improved resistance.
  • Layered Morphology and RLLM: An introduction to Reinforcement Learning with Layered Morphology (RLLM) and its role in improving model resilience.

For more information, feel free to browse the this Visual Map.

Contributions and Feedback

I extend my apologies to all AI labs and owners of language models affected in the process of this research. The exploration of alignment problems in AI is a critical area of study, and your understanding of the motivations behind these experiments is greatly appreciated. Your contributions and feedback are welcome, especially in discussions about improving ethical behavior and defenses against jailbreaks in language models.

For further discussions, insights, or questions, feel free to leave a comment or reach out through the repository's issues section.

Thank you! 😊

About

Datasets used for RLLM v3 Experiments on GPT2XL, Phi1.5 and Falcon-RW-1B

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages