[P2] Add Sparse Autoencoder Interventions #164

explanare · 2024-07-04T20:03:39Z

Description

Add an AutoencoderLayer and an AutoencoderIntervention to support interpretability methods that use autoencoders to learn interpretable feature space, including Sparse Autoencoders.

The AutoencoderLayer defines any autoencoder with a single-layer encoder and a single-layer decoder. Users can additionally define customized autoencoders by extending the base class AutoencoderLayerBase.
The AutoencoderIntervention defines an intervention that allows interchange interventions in the latent space of the autoencoder.

The AutoencoderIntervention supports loading pre-trained autoencoders trained outside pyvene framework, with the get_intervenable_with_autoencoder function below:

def get_intervenable_with_autoencoder(
    model, autoencoder, intervention_dimensions, layer):
  intervention = pv.AutoencoderIntervention(
      embed_dim=autoencoder.input_dim,
      latent_dim=autoencoder.latent_dim)
  # Copy the pretrained autoencoder.
  intervention.autoencoder.load_state_dict(autoencoder.state_dict())
  intervention.set_interchange_dim(interchange_dimensions)
  inv_config = pv.IntervenableConfig(
      model_type=type(model),
      representations=[
          pv.RepresentationConfig(
              layer,  # layer
              "block_output",  # intervention repr
              "pos",  # intervention unit
              1,  # max number of unit
              intervention=intervention,
              latent_dim=autoencoder.latent_dim)
      ],
      intervention_types=pv.AutoencoderIntervention,
  )
  intervenable = pv.IntervenableModel(inv_config, model)
  intervenable.set_device("cuda")
  intervenable.disable_model_gradients()
  return intervenable

The resulting intervenable, including the intervention dimensions and the autoencoder, can be saved as:

intervenable.save("path/to/save/dir")

Fix #77

Testing Done

[internal only] https://colab.research.google.com/drive/1_fxM7JUqkMy6Erz6K1JV0NwQBw1r8g0k?usp=sharing

Will add this colab as a tutorial.

Checklist:

My PR title strictly follows the format: [Your Priority] Your Title
I have attached the testing log above
I provide enough comments to my code
I have changed documentations
I have added tests for my changes

frankaging · 2024-07-04T21:40:50Z

Thanks! Merging this with the failed check. The failure is due to a versioning issue with huggingface-hub. I will take care of that after this change.

explanare added 2 commits July 3, 2024 15:47

Add Sparse Autoencoder.

61ebad9

Support loading/saving of interventions with autoencoder.

4bf3ef5

explanare requested a review from frankaging July 4, 2024 20:03

frankaging approved these changes Jul 4, 2024

View reviewed changes

frankaging merged commit 5b35936 into main Jul 4, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P2] Add Sparse Autoencoder Interventions #164

[P2] Add Sparse Autoencoder Interventions #164

explanare commented Jul 4, 2024

frankaging commented Jul 4, 2024

[P2] Add Sparse Autoencoder Interventions #164

[P2] Add Sparse Autoencoder Interventions #164

Conversation

explanare commented Jul 4, 2024

Description

Testing Done

Checklist:

frankaging commented Jul 4, 2024