Table of contents:
The goal of the project was to simulate collaborative behaviour in a custom multi-agent reinforcement learning (MARL) environment. We look at two types of algorithms: Independent Q-Learning (IQL)
and Value Decomposition Network (VDN)
. The environment used is a Food Collector env detailed in the environment description section. The results are shown below. We can see clear collaborative behaviour from VDN, while the independent learning apprach leads to more indiviudalistic, greedy behaviour:
VDN | IQL |
---|---|
The motivation for this project comes mainly from two sources:
- Simulating Green Beard Altruism, which explores the effects of natural selection on behaviour
- Emergent Tool Use from Multi-Agent Interaction by OpenAI, which explores complex collaborative behaviour in RL agents.
Multi-Agent Reinforcement Learning, custom Gym enironment, Value Decompositon Network, Independent Q-Learning, Food Collector env
The environment is implemented as a grid world, with a 11x11 grid. Agents are colored red and oragne for better identification, the food is colored green, home is colored blue and walls are colored grey. The environment is based on and fully compatible with OpenAI Gym.
The objective of the game is simple: one of the agents needs to eat the food and then they both need to return home. The game only ends if the food is eaten and both agents are in the home area. Moreover, the agents get bonus points if they are both next to the food when it is eaten. Therefore, the expected optimal behaviour is:
- Both agents get close to the food (agent that spawns closer to food waits for the other agent)
- One of them eats the food
- They both return home straight after
On the contrary, a greedy behaviour would be for the agent closer to the food to immediately eat it.
For simplicity and faster training, we use a feature vector for the state representation. The observable state space for each agent consists of a 7-element vector
:
- Two elements to describe the relative x and y distance to the food
- Two elements to describe the relative x and y distance to the home
- Two elements to describe the relative x and y distance to the other agent
- A binary element describing whether food has been eaten yet or not
Each agent has 4 actions to chose: move up, down, left or right. If an illegal move is chosen, such as moving into a wall or colliding with another agent, the agents stay in place.
To observe collaborative behaviour we had to set the reward system so that it promotes collaboration. Therefore, we introduced bonus rewards for close proximity to the food when it is eaten.
Positive rewards | Negative rewards | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
*Reward given and game ends only if the food is eaten
The state and action spaces in our environment were simple enough that we could implement a tabular solution such as IQL. This method works by training each agent separately and including the other agent as part of the environment. In IQL each agent tries to maximize it’s own reward and is optimized using it’s own objective function.
At a lower level, VDN is similar to IDQL. There are multiple DQL agents, with their own networks and their own state representation inputs. The key difference is that the networks are optimized using a joint value function (Figure below). VDN backpropagates the total team reward signal back to each of the individual networks. As a result, the agents optimize their behaviour towards the benefit of all agents, promoting collaboration.
Source: https://arxiv.org/pdf/1706.05296.pdf
For IQL we observed greedy behaviour
, where the agent that is closer to the food went straight for it without waiting for the other agent. Interestingly, the other agent anticipated that and moved towards home ignoring the position of the food. This behaviour was agent invariant, that is we didn’t have a situation where one agent learns to always stay near home and the other one always goes for food. The only case when both agents go for food was when the food spawned in a similar distance to both agents. In this case both of them moved towards it, however, given previous observations this behaviour can be interpreted more as competitive than collaborative.
In case of VDN we observed full collaboration
. Each episode, the agents waited for each other before eating the food, thus obtaining all the bonus rewards. This result is expected since by definition, VDN aims to optimize the team reward. It also shows the contrast between independent learning and more interconnected methods. In case of IQL each agent prioritizes itself, while in the case of VDN the agents prioritize the entire network.
The key issue with independent learning (IQL) is that the behaviour of the other agent changes over time, thus, the environment is not static and we have no convergence guarantees. This issue was prominent in our case: the agents learned particular values for each state-action pair, however, since the behaviour of the other agent changes over time, those state-action values quickly become outdated. Therefore, the agents needed to learn and re-learn several times the state-action values, leading to convergence only after around 500,000 episodes.
On the other hand, when using VDN we could see good results after as little as 10,000 episodes. The only caveat here was tuning the memory size, if too small or too large it lead to unstable learning (agents not converging at all).