Cottention Transformer

This repository contains the official implementation of the paper "Cottention: Linear Transformers With Cosine Attention".

Paper found here: https://arxiv.org/abs/2409.18747

Requirements

To install the necessary dependencies and the custom CUDA kernel, follow these steps:

Install the required Python packages:
```
pip install -r requirements.txt
```

Navigate to the CUDA kernel directory and install the kernel:

cd cuda_kernel
python -m pip uninstall FastAttention -y
python setup.py install

Training

BERT

To train the BERT Cottention model, complete the following:

Create dataset with BERT_Trainer/create_hf_datasets.py
Pre-tokenize dataset with BERT_Trainer/map_hf_dataset.py
Train model with BERT_Trainer/train.py

To run train.py with multiple gpus, refer to the provided training script. Here's an example of how to run the training using a job scheduler (e.g., SLURM):

#!/bin/sh

#SBATCH -J training_job
#SBATCH -p node_to_run_on
#SBATCH --exclusive
#SBATCH -o bert_train.out
#SBATCH --gres=gpu:8
#SBATCH --mem=500G

cd path/to/cottention_transformer
/path/to/venv/bin/torchrun --nproc_per_node 8 --master-port 29503 BERT_Trainer/train.py

Make sure to replace /path/to/cottention_transformer with the actual path to your Cottention Transformer repository and /path/to/venv/bin/torchrun with the path to your Python virtual environment's torchrun executable.

This script sets up the necessary environment variables and launches the training using torchrun for distributed training across multiple GPUs.

Note: The provided code assumes you are using a SLURM job scheduler. If you are using a different job scheduler or running the training script directly, you may need to modify the script accordingly.

GPT

To train the GPT Cottention model, complete the following:

Create dataset with GPT_Trainer/train.py

To run train.py with multiple machines, refer to the provided training script. Here's an example of how to run the training using a job scheduler (e.g., SLURM):

#!/bin/bash

#SBATCH --job-name=training_job
#SBATCH -p batch
#SBATCH --exclusive
#SBATCH -o train.out
#SBATCH --nodes=1
#SBATCH --gres=gpu:8
#SBATCH --mem=500G

# No. Nodes
nnodes=1

# No. Tasks Per Node
nproc_per_node=8

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo Node IP: $head_node_ip

export LOGLEVEL=INFO

cd /path/to/cottention_transformer

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 srun /path/to/venv/bin/torchrun \
--nnodes $nnodes \
--nproc_per_node $nproc_per_node \
--rdzv_id $RANDOM \
--rdzv_backend c10d \
--rdzv_endpoint $head_node_ip:29500 \
gpt_trainer/train.py

Make sure to replace /path/to/cottention_transformer with the actual path to your Cottention Transformer repository and /path/to/venv/bin/torchrun with the path to your Python virtual environment's torchrun executable.

This script sets up the necessary environment variables and launches the training using torchrun for distributed training across multiple GPUs.

Note: The provided code assumes you are using a SLURM job scheduler. If you are using a different job scheduler or running the training script directly, you may need to modify the script accordingly.

Finetuning BERT

To finetune the BERT Cottention model, complete the following:

Create dataset with BERT_Trainer/create_hf_ft_datasets.py
Tokenize dataset with BERT_Trainer/map_hf_ft_datasets.py
Finetune with BERT_Trainer/finetune.py

No different than with training. We have provided all relevant code for BERT. The datasets relevant for GPT will be provided upon release for anonymity.

Inference

An inference playground can be utilized with pretrained models by running the BERT_Trainer/infer.py and replacing model_path with the path to the checkpoint of the model. Additionally, for softmax models, attention_type should be set to soft and for cosine, attention_type should be set to cos in the files themselves.

Pre-trained Models

You can download pretrained models here:

Huggingface models with all models.

Google Drive Cottention BERT Model trained on Wikipedia and BookCorpus datasets.

Google Drive Cottention GPT Model trained on The Pile dataset.

Results

BERT

Our model achieves the following performance on:

Model	MNLI-(m/mm)	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	Average
BERT_softmax	81.8/82.5	86.5	89.9	90.5	80.5	78.3	90.0	67.9	83.1
BERT_cosine	80.6/81.1	86.2	89.3	90.1	77.8	76.5	88.6	66.4	81.8

Versus published results of:

Model	MNLI-(m/mm)	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	Average
BERT_BASE	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	79.6

GPT

Our GPT models obtain PPL scores of:

Cosine 300M: 11.6
Cosine 1.2B: 9.9
Softmax 300M: 10.3
Softmax 1.2B: 9.1

Contributing

We welcome contributions to the Cottention Transformer repository. If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request. Make sure to follow the established coding style and guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.vscode		.vscode
BERT_Trainer		BERT_Trainer
Cuda_Kernel		Cuda_Kernel
Dataset		Dataset
GPT_Trainer		GPT_Trainer
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cottention Transformer

Requirements

Training

BERT

GPT

Finetuning BERT

Inference

Pre-trained Models

Results

BERT

GPT

Contributing

About

Releases

Packages

Contributors 2

Languages

gmongaras/Cottention_Transformer

Folders and files

Latest commit

History

Repository files navigation

Cottention Transformer

Requirements

Training

BERT

GPT

Finetuning BERT

Inference

Pre-trained Models

Results

BERT

GPT

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages