Skip to content

Various fundamental reinforcement learning algorithms implemented from scratch

Notifications You must be signed in to change notification settings

aylint/rl-algorithms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

RL Algorithm Implementations

RL

This project contains implementations of some basic reinforcement learning algorithms. I'm simply using a text based toy text environment to minimize resource requirements. The algorithms will also work on more complex text environments and learn the task given enough time and CPU.

Temporal-Difference Learning

Temporal-Difference (TD) learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. It resides in the heart of Reinforcement Learning. In a single sentence, I would define TD as reevaluating your prediction after taking a step. You have a prediction for your destination. When you take a step, reestimate. Rinse and repeat. The motto is, "why wait to start learning until the goal is reached?".

SARSA: On Policy TD Control

Consider an episode consisting of states, actions and rewards. Like S_{t} , A_{t} , R_{t+1} , S_{t+1} , A_{t+1}, ... Sound familiar? Look again, it is SARSA

SARSA

Implementation of the algorithms is straight forward. Check out my implementation here

Q-learning: Off-policy TD Control

In Q-learning, independent of the followed policy, the learned action-value function, Q, directly approximates the optimal action-value function. Usually the q values are stored in a table called Q-table for easy access.

Q-learning

Check my implementation here

Expected Sarsa

Instead of using the maximum of next action-state pairs, use expected value, then you have Expected SARSA.

expSARSA

My implementation is here

Planning

Model-based methods try to predict the future and rely on planning, unlike the model-free algorithms like TD. A common application area is maze problems.

Dyna Q

Dyna-Q algorithm integrates planning, acting and learning. The same reinforcement method is applied both for learning and planning.

DynaQ

My implementation

Prioritized Sweeping

Instead of going towards the goal, prioritized sweeping algorithm focuses on going backward, from the goal. Instead of selecting uniformly random actions, we maintain a priority queue.

pri_sw

My implementation

References

The algorithm psuedo-codes and figures are from "the RL book" of Sutton and Barto.

Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.