Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P2] Add Sparse Autoencoder Interventions #164

Merged
merged 2 commits into from
Jul 4, 2024
Merged

[P2] Add Sparse Autoencoder Interventions #164

merged 2 commits into from
Jul 4, 2024

Conversation

explanare
Copy link
Collaborator

Description

Add an AutoencoderLayer and an AutoencoderIntervention to support interpretability methods that use autoencoders to learn interpretable feature space, including Sparse Autoencoders.

  • The AutoencoderLayer defines any autoencoder with a single-layer encoder and a single-layer decoder. Users can additionally define customized autoencoders by extending the base class AutoencoderLayerBase.
  • The AutoencoderIntervention defines an intervention that allows interchange interventions in the latent space of the autoencoder.

The AutoencoderIntervention supports loading pre-trained autoencoders trained outside pyvene framework, with the get_intervenable_with_autoencoder function below:

def get_intervenable_with_autoencoder(
    model, autoencoder, intervention_dimensions, layer):
  intervention = pv.AutoencoderIntervention(
      embed_dim=autoencoder.input_dim,
      latent_dim=autoencoder.latent_dim)
  # Copy the pretrained autoencoder.
  intervention.autoencoder.load_state_dict(autoencoder.state_dict())
  intervention.set_interchange_dim(interchange_dimensions)
  inv_config = pv.IntervenableConfig(
      model_type=type(model),
      representations=[
          pv.RepresentationConfig(
              layer,  # layer
              "block_output",  # intervention repr
              "pos",  # intervention unit
              1,  # max number of unit
              intervention=intervention,
              latent_dim=autoencoder.latent_dim)
      ],
      intervention_types=pv.AutoencoderIntervention,
  )
  intervenable = pv.IntervenableModel(inv_config, model)
  intervenable.set_device("cuda")
  intervenable.disable_model_gradients()
  return intervenable

The resulting intervenable, including the intervention dimensions and the autoencoder, can be saved as:

intervenable.save("path/to/save/dir")

Fix #77

Testing Done

[internal only] https://colab.research.google.com/drive/1_fxM7JUqkMy6Erz6K1JV0NwQBw1r8g0k?usp=sharing

Will add this colab as a tutorial.

Checklist:

  • My PR title strictly follows the format: [Your Priority] Your Title
  • I have attached the testing log above
  • I provide enough comments to my code
  • I have changed documentations
  • I have added tests for my changes

@explanare explanare requested a review from frankaging July 4, 2024 20:03
@frankaging
Copy link
Collaborator

Thanks! Merging this with the failed check. The failure is due to a versioning issue with huggingface-hub. I will take care of that after this change.

@frankaging frankaging merged commit 5b35936 into main Jul 4, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[P2] Sparse autoencoders
2 participants