Transformer Resources

books

Natural Language Processing with Transformers, Lewis Tunstall, Leandro von Werra, Thomas Wolf, 2022

github repo with source code: https://github.com/nlp-with-transformers/notebooks

articles

Adam: A Method for Stochasitc Optimization, D. Kingma et al, 2014
Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning, Roemmele et al, 2011
Catastrophic Interference In Connectionist Networks: The Sequential Learning Problem, McCloskey, Cohen, 1989
Attention Is All You Need, Vaswani et al, Google Brain, 2017
The Annotated Transformer - delving into Vaswani's paper "Attention Is All You Need", 2018
The Illustrated Transformer, Jay Alamar's blog, 2021
Attention in Natural Language Processing, Galassi et al., 2020
Deriving Machine Attention from Human Rationales, Y. Bao et al, 2018
HyperAttention: Long-context Attention in Near-Linear Time, Insu Han et al, 2023
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., Google AI, 2019
FAIRSEQ: A Fast, Extensible Toolkit for Sequence Modeling, Ott et al., 2019
Autoencoders, Dor Bank et al, 2021
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Chung et al., 2014
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches, Cho et al., U de Montreal, 2014
Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network, A. Sherstinsky, 2021
Bidirectional Recurrent Neural Networks, Mike Schuster, Kuldip Paliwal, 1997
Neural Networks for Pattern Recognition, C. M. Bishop, 1995
Translation Modeling with Bidirectional Recurrent Neural Networks, M. Sundermeyer, et al, 2014
Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations, E. Kiperwasser, Y. Goldberg, 2016
A Decomposable Attention Model for Natural Language Inference, Parikh et al., Google Research, 2016
Sequence to Sequence Learning with Neural Networks, Sutskever et al, Google Research, 2014
Transforming Auto-encoders, G. Hinton, A. Krizhevsky, et al., 2011
A Neural Probabilistic Language Model, Y. Bengio et al, 2003
Learning to combine foveal glimpses with a third-order Boltzmann machine, H. Larochelle and G. Hinton, 2010
Long Short-Term Memory, Sepp Hochreiter et al., 1997
LSTM Can Solve Hard Long Time Lag Problems, Sepp Hochreiter, Juergen Schmidthhuber, NIPS, 1996
End-to-End Continuous Speech Recognition using Attention-based Recurrent NN: First Results, Jan Chorowski, Dzmitry Bandanau et al, 2014
Recurrent Continuous Translation Models, Nal Kalchbrenner, Phil Blunsom, Oxford U, 2013
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, K. Cho, Dzmitry Bahdanau, et al, 2014
Understanding LSTM: a tutorial into Long Short-Term Memory, R. Staudemeyer et al., 2019
Generating Sequences with Recurrent Neural Networks, Alex Graves, UofToronto, 2014
DeLighT: Deep and Light-weight Transformer, S. Mehta et al, 2020
Meta-Transformer: A Unified Framework for Multimodal Learning, Zhang, Y, et al, 2023
Small-scale proxies for large-scale Transformer training instabilities, M. Wortsman et al, 2023
Formal Algorithms for Transformers, M. Phuong et al, DeepMind, 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao et al, Stanford U., 2022
Boolformer: Symbolic Regression of Logic Functions with Transformers, d'Ascoli et al, 2023
Transformer-Based Direct Hidden Markov Model for Machine Translation, W. Wang et al, Aachen U, 2021
Simplifying Transformer Blocks, Bobbe He et al, 2023
Introduction to Transformers: an NLP Perspective, T. Xiao et al, 2023
xLSTM: Extended Long Short-Term Memory, Maximilian Beck et al, 2024
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, Tri Dao , 2023

Understanding Transformers, Interpretability of Transformers, Mathematical Models of Transformers

A Mathematical Framework for Transformer Circuits, Nelson Elhage et al, Anthropic, 2021

the full article online: here
Understanding Transformer Reasoning Capabilities via Graph Algorithms, Clayton Sanford et al, 2024
Transformers need glasses! Information over-squashing in language tasks, Federico Barbero et al, Google DeepMind, 2024
Understanding Transformers via N-gram Statistics, Timothy Nguyen, Google DeepMind, 2024
When Can Transformers Count to n? G. Yehudai et al, NYU, DeepMind, 2024

Embeddings

Embeddings in Natural Language Processing: Theory and Advances in Vector Representation of Meaning, M. Pilhevar, J. Camacho-Collados, 2021
Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al, Google, 2013
Factors Influencing the Surprising Instability of Word Embeddings, L. Wendtlandt et al, U. Michigan Ann Arbor, 2018
Is Cosine-Similarity of Embeddings Really About Similarity? Harald Steck, Chaitanya Ekanadham, 2024

In-context learning with Transformers

Hierarchical Attention Networks for Document Classification, Z. Yang et al, CMU, 2016
Attention using Context Vector: Hierarchical Attention Networks for Document Classification, DataScience.StackExchange, 2017
What is the difference between positional vector and attention vector used in transformer model? DataScience.StackExchange, 2019
Why does an attention layer in a transformer learn context?, DataScience.StackExchange, 2020
Data Distributional Properties Drive Emergent In-Context Learning in Transformers, S. Chan et al, DeepMind, NeurIPS, 2022
In-Context Learning with Transformer-Based Neural Sequence Models, Jair Ribeiro, Towards AI publication, 2023
In-context Learning and Induction Heads, Catherine Olsson et al, Anthropic, 2023
Transformers Learn In-Context by Gradient Descent, Johannes von Oswald et al, 2023
Transformers as Algorithms: Generalization and Stability in In-context Learning, Y. Li et al, 2023
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes, S. Garg, 2023

related repo: https://github.com/dtsip/in-context-learning

related youtube presentation: https://www.youtube.com/watch?v=DiJsg93zQDc
The Transient Nature of Emergent In-Context Learning in Transformers, A. Singh et al, UCL, 2023
Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions, S. Bhattamishra, Oxford U., 2023
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models, D. Fu et al, USC, 2023
Learning linear models in-context with transformers with Spencer Frei (UC Davis), Imperial College London, youtube video
In-context Learning in Transformers - SLT Seminar 46, youtube video

Cross-Layer Attention in Transformers

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, William Brandon et al, MIT CSAIL, 2024

Reinforcement Learning in Transformers

Decision Transformer: Reinforcement Learning via Sequence Modeling, Lili Chen et al, UC Berkeley, 2021
Stanford CS 25: Lecture 4 Decision Transformer: Reinforcement Learning via Sequence Modeling, youtube video
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions, Yevgen Chebotar et al, DeepMind, 2023

related repo: https://qtransformer.github.io/

Hyper-Networks, MotherNet and PFNs (Prior-Data Fitted Networks)

HyperNetworks, David Ha, Google Brain, 2017
MotherNet: A Foundational Hypernetwork for Tabular Classification, A.C. Mueller et al, Microsoft Research, 2023
Transformers Can Do Bayesian Inference, Samuel Mueller et al,U of Freiburg, 2023
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second, Noah Hollman et al, 2022

Sequential Decision Modeling and Predictive Sequence Models

Stanford CS 25: Lecture 4 Decision Transformer: Reinforcement Learning via Sequence Modeling, youtube video
Decision Transformer: Reinforcement Learning via Sequence Modeling, Lili Chen et al, UC Berkeley, 2021
Decision Transformer: Reinforcement Learning via Sequence Modeling (Research Paper Explained), youtube video
Using Sequences of Life-events to Predict Human Lives, Germans Savcisens et al, 2023

... More articles on Transformers

Vision Transformers

Visualizing Attention in Vision Transformer with Aryan Jadon
Vision Transformers, Explained, Skylar Jean Callis, Toward Data Science, Feb, 2024
Comparison of Convolutional Neural Networks and Vision Transformers (ViTs) with Illas Papastratis
Do Vision Transformers See Like Convolutional Neural Networks? M. Raghu, Google Brain, 2022
How Do Vision Transformers Work? N. Park et al, 2022
An Image is Worth 16X16 Wwords: Transformers for Image Recognition at Scale, A. Dosovitskiy, 2021
Transformers for Image Recognition at Scale, Nel Houlsby and Dirk Weissenborn, Dec 2020, blog
Why Transformers are Slowly Replacing CNNs in Computer Vision? Pranoy Radhakrishnan, Aug 2021, Becoming Human: Artificial Intelligence Magazine
Vision Transformers (ViT) in Image Recognition – 2024 Guide, Gaudenz Boesch, viso.ai blog

Long Short Term Memory (the precursor of Transformers)

xLSTM: Extended Long Short-Term Memory, Maximilian Beck et al, 2024
Long Short Term Memory, Sepp Hochreiter, 1997
LST Can Solve Hard Long Time Lag Problems, Sepp Hochreiter et al, 1996
Understandning LSTM : a Tutorial into Long Short Term Memory Recurrent Networks, Ralf C. Staudemeyer et al, 2019
Understanding LSTM: Colah's Blog, 2015
Fundamentals of RNN and LSTM, Alex Sherstinsky, MIT, 2020

State Space Models (an alternative of Transformers)

Multi-Head State Space Model for Speech Recognition, Yassir Fathullah et al, U of Cambridge, 2023
HiPPO: Recurrent Memory with Optimal Polynomial Projections, A. Gu et al, Stanford U., 2020
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers, A. Gu et al, 2021
Diagonal State Spaces are as Effective as Structured State Spaces, A. Gupta, A. Gu et al, 2022
It’s Raw! Audio Generation with State-Space Models, K. Goel, A. Gu, et al, 2022
How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections, A. Gu et al, Stanford U., 2022
Efficiently Modeling Long Sequences with Structured State Spaces, K. Goel, A. Gu et al, 2022
Hungry Hungry Hippos: Towards Language Modeling with State Space Models, D. Fu, T. Dao, 2023
Mamba: Linear-Time Sequence Modeling with Selective State Spaces, A. Gu, T. Dao, CMU, 2023
Mamba repo: https://github.com/state-spaces/mamba
Mamba Explained, Kola Ayonrinde, The Gradient, March 2024
Mamba: Can it replace Transformers? Vishal Rajput, Medium Jan 8, 2024
as a pdf file: here
Why Mamba was rejected? Joe El Khoury, Medium, Feb 28, 2024
as a pdf file: here
Zamba: A Compact 7B SSM Hybrid Model, Paolo Glorioso et al, 2024

related repo: https://github.com/kyegomez/Zamba

Time Series Forecasting

Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series, V. Ekambaram et al, IBM, 2024
iTransformer: The Latest Breakthrough in Time Series Forecasting, Marco Peixeiro, Towards Data Science, April 2024

relevant paper: iTransformer: Inverted Transformers Are Effective for Time Series Forecasting, Yong Liu et al, 2023
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel et al, Google, 2023

relevant repo: https://github.com/google-research/text-to-text-transfer-transformer
TimesFM: Google's Foundation Model For Time-Series Forecasting, Nikos Kafritas, 2023, AI Horizon Forecast
MOIRAI: Salesforce's Foundation Transformer For Time-Series Forecasting, Nikos Kafritas, 2023, AI Horizon Forecast

relevant paper: Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting, K. Rasul et al, 2023

relevant paper: A decoder-only foundation model for time-series forecasting, A. Das et al, 2023

relevant paper: Chronos: Learning the Language of Time Series, AF Ansari et al, 2024

relevant paper: Unified Training of Universal Time Series Forecasting Transformers, Woo, G et al, 2024

Medium

The A-Z of Transformers: Everything You Need to Know with François Porcher

related paper: Neural Machine Translation by Jointly Learning to Align and Translate, D. Bahdanau et al, 2015

related repo: Transformers from Scratch
Transformers — Intuitively and Exhaustively Explained with Daniel Warfield
De-coded: Transformers explained in plain English with Chris Hughes
Transformers Explained Visually — Not Just How, but Why They Work So Well, Ketan Doshi, Jun 2, 2021
Transformers Explained Visually (Part 1): Overview of Functionality with Ketan DOshi, Dec 13, 2020
Transformers Explained Visually (Part 2): How it works, step-by-step with Ketan Doshi, Jan 2, 2021
Transformers Explained Visually (Part 3): Multi-head Attention, deep dive, Jan 16, 2020
Explainable AI: Visualizing Attention in Transformers with Abby Morgan
Transformers - The Bigger The Better? with Jordi Torres
How to Take Advantage of the New Disruptive AI Technology Called Transformers with Jordi Torres
Transformes: The New Gem of Deep Learning with Jordi Torres
Transfer Learning: The Democratization of Transformers with Jori Torres
Visualizing Attention in Vision Transformer with Aryan Jadon
Vision Transformers, Explained, Skylar Jean Callis, Toward Data Science, Feb, 2024
The Transformer Architecture of GPT Models with Beatriz Stollniz
Learning Transformers Code First Part 1 - The Setup with Lily Hughs-Robinson
Learning Transformers Code First Part 2 - GPT Up Close and Personal with Lily Hughs-Robinson
Were Abstract Painters The First Encoders with Wouter van Heeswijk
Comparison of Convolutional Neural Networks and Vision Transformers (ViTs) with Illas Papastratis

related paper: Do Vision Transformers See Like Convolutional Neural Networks? M. Raghu, Google Brain, 2022

related paper: How Do Vision Transformers Work? N. Park et al, 2022

related paper: An Image is Worth 16X16 Wwords: Transformers for Image Recognition at Scale, A. Dosovitskiy, 2021

related blog: https://blog.research.google/2020/12/transformers-for-image-recognition-at.html

related blog: https://becominghuman.ai/transformers-in-vision-e2e87b739feb

related blog: https://viso.ai/deep-learning/vision-transformer-vit/
Understanding Temporal Fusion Transformer with Mouna Labiadh

related article: Temporal Fusion Transformer for Interpretable Multi-horizon Time Series Forecasting
Forecasting book sales with Temporal Fusion Transformer with Mouna Labiadh
Personalized Recommendations with Transformers with Enis Teper
Hidden Markov Models Simplified with Sanjay Dorairaj
Rubik’s cubes and Markov chains with Eduardo Teste
Implementing Seq2Seq Models for Efficient Time Series Forecasting with Max Brenner
Fine-Tune Smaller Transformer Models: Text Classification, Ida Silfverskiöld, 2024

Classes and Lectures on Transformers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransformersResources.md

TransformersResources.md

Transformer Resources

books

articles

Understanding Transformers, Interpretability of Transformers, Mathematical Models of Transformers

Embeddings

In-context learning with Transformers

Cross-Layer Attention in Transformers

Reinforcement Learning in Transformers

Hyper-Networks, MotherNet and PFNs (Prior-Data Fitted Networks)

Sequential Decision Modeling and Predictive Sequence Models

Vision Transformers

Long Short Term Memory (the precursor of Transformers)

State Space Models (an alternative of Transformers)

Time Series Forecasting

Medium

Classes and Lectures on Transformers

Stanford CS 25

Youtube videos and presentations

GPT - DYI

Files

TransformersResources.md

Latest commit

History

TransformersResources.md

File metadata and controls

Transformer Resources

books

articles

Understanding Transformers, Interpretability of Transformers, Mathematical Models of Transformers

Embeddings

In-context learning with Transformers

Cross-Layer Attention in Transformers

Reinforcement Learning in Transformers

Hyper-Networks, MotherNet and PFNs (Prior-Data Fitted Networks)

Sequential Decision Modeling and Predictive Sequence Models

Vision Transformers

Long Short Term Memory (the precursor of Transformers)

State Space Models (an alternative of Transformers)

Time Series Forecasting

Medium

Classes and Lectures on Transformers

Stanford CS 25

Youtube videos and presentations

GPT - DYI