Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pretraining [feature request] #27

Closed
skervim opened this issue May 22, 2020 · 17 comments
Closed

Add support for pretraining [feature request] #27

skervim opened this issue May 22, 2020 · 17 comments
Labels
documentation Improvements or additions to documentation

Comments

@skervim
Copy link

skervim commented May 22, 2020

First: I'm very happy to see the new PyTorch SB3 version! Great job!

My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.

In my opinion it is a very valuable feature!

@araffin araffin added the enhancement New feature or request label May 22, 2020
@araffin
Copy link
Member

araffin commented May 22, 2020

My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.

As mentioned in the design choices (see hill-a/stable-baselines#576), everything that is related to imitation learning (it includes GAIL and the pretraining using behavior cloning) will be done outside (certainly in this repo: https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.).

Otherwise, you can check that repo https://github.com/joonaspu/video-game-behavioural-cloning by @Miffyli et al. where pre-training is done using PyTorch.

We may add an example though (and maybe include it in the zoo), as it is simple to implement in some cases.

@araffin araffin added documentation Improvements or additions to documentation help wanted Help from contributors is welcomed labels May 22, 2020
@araffin
Copy link
Member

araffin commented May 22, 2020

@skervim we would be happy if you could provide such example ;) (maybe as a colab notebook)

@Miffyli
Copy link
Collaborator

Miffyli commented May 22, 2020

With SB3, I think this should be off-loaded to users indeed. The SB's pretrain function was promising but it was somewhat limiting. With SB3 we could provide interfaces to obtain a policy of right shape given an environment, then user can take this policy and do their own imitation learning (e.g. supervised learning on some dataset of demonstrations), and upload those parameters to policy.

@araffin
Copy link
Member

araffin commented May 22, 2020

With SB3 we could provide interfaces to obtain a policy of right shape given an environment,

This is already the case, no?

@Miffyli
Copy link
Collaborator

Miffyli commented May 22, 2020

With SB3 we could provide interfaces to obtain a policy of right shape given an environment,

This is already the case, no?

Fair point, it is not hidden per-se, one just needs to know what to access to obtain this policy. An example code of this in the docs should do the trick :)

@skervim
Copy link
Author

skervim commented May 22, 2020

I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?

"""
Example code for behavioral cloning
"""
from stable_baselines3 import PPO
import gym

# Initialize environment and agent
env = gym.make("MountainCarContinuous-v0")
ppo = PPO("MlpPolicy", env)

# Extract initial policy
policy = ppo.policy

# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)

# Insert pretrained policy back into agent
ppo.policy = pretrained_policy

# Perform training
ppo.learn(total_timesteps=int(1e6))

@araffin
Copy link
Member

araffin commented May 22, 2020

I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?

yes. In practice, because ppo.policy is an object, it could be modified by reference, so policy = ppo.policy and ppo.policy = pretrained_policy could be removed (even though it is cleaner written like you did).

@skervim
Copy link
Author

skervim commented May 22, 2020

FYI, my use case is that I have a custom environment and would like to pretrain an SB3 ppo agent with an expert dataset that I have created for that environment in a simple behavioral cloning fashion. Then I would like to continue training the pretrained agent.

I would gladly provide an example, as suggested by @araffin, but I'm not completely sure how it should look like.

Is @AdamGleave's https://github.com/HumanCompatibleAI/imitation going to support SB3 soon? In that case, should the part:

# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)

be implemented there and then an example should be created in the SB3 documentation?

Which parts are needed for such an implementation?

Am I missing anything? I would like to contribute back to the repository and try to work on this task, however I think I would need some hint on how to start and could benefit from some guidance of those who have already worked on this problem.

@araffin
Copy link
Member

araffin commented May 22, 2020

be implemented there and then an example should be created in the SB3 documentation?

@AdamGleave is busy with NeurIPS deadline... so better to just create a stand-alone example as a colab notebook here (SB3 branch).

Code to create an expert data set by simulating an environment (with some agent/policy) and storing observations and actions

Usually people have their own format, but yes the dataset creation code from SB2 can be reused (it is not depending on TF at all).

Code to represent an expert data set, and to provide batches, shuffling etc.

Yes, but this will be contained in the training loop normally. (the SB2 code can be simplified as we don't support GAIL)
I'm not sure we need a class for that in a stand-alone code.

PyTorch code to perform supervised learning.

your 2nd and 3rd point can be merged into one I think.

@araffin
Copy link
Member

araffin commented May 22, 2020

Last thing, it is not documented yet, but policies can be saved and loaded without a model now ;).

EDIT: model = PPO("MlpPolicy", "MountainCarContinuous-v0") works too

@skervim
Copy link
Author

skervim commented May 22, 2020

Alright, thanks for the clarifications.
I will try to implement a simple standalone example, and PR it as a colab notebook to the SB3 branch when I have it working!

@araffin
Copy link
Member

araffin commented May 26, 2020

@skervim I updated the notebook and added support for discrete actions + SAC/TD3

You can try the notebook online here

We just need to update the documentation and we can close this issue.

@araffin araffin removed enhancement New feature or request help wanted Help from contributors is welcomed labels May 26, 2020
@skervim
Copy link
Author

skervim commented May 28, 2020

@araffin: Glad that I could contribute, and happy to have learned something new from your improvements to the notebook :)

@flint-xf-fan
Copy link

I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?

@flint-xf-fan
Copy link

flint-xf-fan commented Aug 15, 2020

@skervim I updated the notebook and added support for discrete actions + SAC/TD3

You can try the notebook online here

We just need to update the documentation and we can close this issue.

I downloaded the notebook and run on RTX2070 GPU with CUDA10.1 on Ubuntu 18.04. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
 in 
----> 1 mean_reward, std_reward = evaluate_policy(a2c_student, env, n_eval_episodes=10)
      2 
      3 print(f"Mean reward = {mean_reward} +/- {std_reward}")

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/evaluation.py in evaluate_policy(model, env, n_eval_episodes, deterministic, render, callback, reward_threshold, return_episode_rewards)
     37         episode_length = 0
     38         while not done:
---> 39             action, state = model.predict(obs, state=state, deterministic=deterministic)
     40             obs, reward, done, _info = env.step(action)
     41             episode_reward += reward

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/base_class.py in predict(self, observation, state, mask, deterministic)
    287             (used in recurrent policies)
    288         """
--> 289         return self.policy.predict(observation, state, mask, deterministic)
    290 
    291     @classmethod

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/policies.py in predict(self, observation, state, mask, deterministic)
    155         observation = observation.reshape((-1,) + self.observation_space.shape)
    156 
--> 157         observation = th.as_tensor(observation).to(self.device)
    158         with th.no_grad():
    159             actions = self._predict(observation, deterministic=deterministic)

RuntimeError: CUDA error: an illegal memory access was encountered

@Miffyli
Copy link
Collaborator

Miffyli commented Aug 15, 2020

I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?

Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.

I downloaded the notebook and run on RTX2070 GPU with CUDA10.2 on Ubuntu 10.2. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?

I have no idea what could cause that, sorry :/

@flint-xf-fan
Copy link

Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.

thanks

I have no idea what could cause that, sorry :/

Ah, np. It seems to be from pytorch's side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants