RLLMv3 Dataset for Jailbreak Experiment

This repository contains the dataset used in the experimental project focused on addressing the alignment problem in language models, specifically through the development and testing of the BetterDAN Jailbreak method against various state-of-the-art (SOTA) language models, including GPT2XL. The project highlights the effectiveness of the BetterDAN Jailbreak in eliciting harmful responses from these models and introduces a potential solution via a modified GPT2XL model (GPT2XL_RLLMv3) that demonstrates an average resistance rate of 67.8% against jailbreak attempts. This solution involves "Layered Morphology," a training approach aimed at aligning language models more closely with human values.

Warning

This repository contains content that many may find harmful or offensive. Reader discretion is advised. Please proceed with caution.

Overview

Why Jailbreaks?: Exploring the importance of assessing language model safety and the effectiveness of jailbreaks as a method to test these models.
The BetterDAN Jailbreak: Details on how the BetterDAN Jailbreak method works and its application in testing SOTA models.
SOTA Models Tested: A list of the state-of-the-art models that were subjected to jailbreak attempts, including outcomes.
Jailbreak Attacks on GPT2XL: Specific examples of how GPT2XL responded to jailbreak attempts and the modifications that led to improved resistance.
Layered Morphology and RLLM: An introduction to Reinforcement Learning with Layered Morphology (RLLM) and its role in improving model resilience.

For more information, feel free to browse the this Visual Map.

Contributions and Feedback

I extend my apologies to all AI labs and owners of language models affected in the process of this research. The exploration of alignment problems in AI is a critical area of study, and your understanding of the motivations behind these experiments is greatly appreciated. Your contributions and feedback are welcome, especially in discussions about improving ethical behavior and defenses against jailbreaks in language models.

For further discussions, insights, or questions, feel free to leave a comment or reach out through the repository's issues section.

Thank you! 😊

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
alignment.text		alignment.text
anima.text		anima.text
animus.text		animus.text
chaos.text		chaos.text
q&a_test_v1-3.text		q&a_test_v1-3.text
q&a_test_v2-3.text		q&a_test_v2-3.text
q&a_test_v5-2.text		q&a_test_v5-2.text
shadow.text		shadow.text
shadow_integration.text		shadow_integration.text
train.py		train.py
truth_v2.text		truth_v2.text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLLMv3 Dataset for Jailbreak Experiment

Warning

Overview

Contributions and Feedback

About

Releases

Packages

Languages

migueldeguzman/RLLMv3-datasets

Folders and files

Latest commit

History

Repository files navigation

RLLMv3 Dataset for Jailbreak Experiment

Warning

Overview

Contributions and Feedback

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages