Skip to content

Latest commit





Folders and files

Last commit message
Last commit date

parent directory


RTBGym: A configurative reinforcement learning environment for real-time bidding research

Table of Contents (click to expand)



RTBGym is an open-source simulation platform for Real-Time Bidding (RTB) of Display Advertising, which is written in Python. The simulator is particularly intended for reinforcement learning algorithms and follows OpenAI Gym and Gymnasium-like interface. We design RTBGym as a configurative environment so that researchers and practitioners can customize the environmental modules including WinningPriceDistribution, ClickThroughRate, and ConversionRate.

Note that RTBGym is publicized under SCOPE-RL repository, which facilitates the implementation of the offline reinforcement learning procedure.

Basic Setting

In RTB, the objective of the RL agent is to maximize some KPIs (number of clicks or conversions) within an episode under given budget constraints.
We often aim to achieve this goal by adjusting a parameter $\alpha$ to control the bid price as follows.

$bid_{t,i} = \alpha \cdot r^{\ast}$,

where $r^{\ast}$ denotes a predicted or expected reward (KPIs).

We often formulate this RTB problem as the following Constrained Markov Decision Process (CMDP):

  • timestep: One episode (a day or a week) consists of several timesteps (24 hours or seven days, for instance).
  • state: We observe some feedback from the environment at each timestep, which includes the following.
    • timestep
    • remaining budget
    • impression level features (budget consumption rate, cost per mille of impressions, auction winning rate, reward) at the previous timestep
    • adjust rate (RL agent's decision making) at the previous timestep
  • action: Agent chooses adjust rate parameter $\alpha$ to maximize KPIs.
  • reward: Total number of clicks or conversions obtained during the timestep.
  • constraints: The pre-determined episodic budget should not be exceeded.

The goal of RTB is to maximize the expected trajectory-wise reward under the budget constraint.


RTBGym provides two standardized RTB environments.

  • "RTBEnv-discrete-v0": Standard RTB environment with discrete action space.
  • "RTBEnv-continuous-v0": Standard RTB environment with continuous action space.

RTBGym consists of the following two environments.

  • RTBEnv: The basic configurative environment with continuous action space.
  • CustomizedRTBEnv: The customized environment given action space and reward predictor.

RTBGym is configurative about the following three modules.

Note that users can customize the above modules by following the abstract class.
We also define the bidding function in the Bidder class and the auction simulation in the Simulator class, respectively.


RTBGym can be installed as a part of SCOPE-RL using Python's package manager pip.

pip install scope-rl

You can also install it from the source.

git clone
cd scope-rl
python install


We provide an example usage of the standard and customized environment.
The online/offline RL and Off-Policy Evaluation examples are provided in SCOPE-RL's README.

Standard RTBEnv

Our standard RTBEnv is available from gym.make(), following the OpenAI Gym and Gymnasium-like interface.

# import rtbgym and gym
import rtbgym
import gym

# (1) standard environment for discrete action space
env = gym.make('RTBEnv-discrete-v0')

# (2) standard environment for continuous action space
env_ = gym.make('RTBEnv-continuous-v0')

The basic interaction is performed using only four lines of code as follows.

obs, info = env.reset(), False
while not done:
    action = agent.act(obs)
    obs, reward, done, truncated, info = env.step(action)

Let's visualize the case with the uniform random policy (in a continuous action case). The discrete case also works in a similar manner.

# import from other libraries
from offlinegym.policy import OnlineHead
from d3rlpy.algos import RandomPolicy as ContinuousRandomPolicy
from d3rlpy.preprocessing import MinMaxActionScaler
import matplotlib.pyplot as plt

# define a random agent (for continuous action)
agent = OnlineHead(
            minimum=0.1,  # minimum value that policy can take
            maximum=10,  # maximum value that policy can take

# (3) basic interaction for continuous action case
obs, info = env.reset()
done = False
# logs
remaining_budget = [obs[1]]
cumulative_reward = [0]

while not done:
    action = agent.predict_online(obs)
    obs, reward, done, truncated, info = env.step(action)
    # logs
    cumulative_reward.append(cumulative_reward[-1] + reward)

# visualize the result
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(remaining_budget[:-1], label='remaining budget')
ax2 = ax1.twinx()
ax2.plot(cumulative_reward[:-1], label='cumulative reward', color='tab:orange')
ax1.set_ylabel('remainig budget')
ax1.set_ylim(0, env.initial_budget + 100)
ax2.set_ylabel('reward (coversion)')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

The Transition of the Remaining Budget and Cumulative Reward during a Single Episode

Note that while we use SCOPE-RL and d3rlpy here, RTBGym is compatible with any other libraries working on the OpenAI Gym and Gymnasium-like interface.

Customized RTBEnv

Next, we describe how to customize the environment by instantiating the environment.

List of environmental configurations: (click to expand)
  • objective: Objective KPIs of RTB, which is either "click" or "conversion".
  • cost_indicator: Timing of arising costs, which is any of "impression", "click", and "conversion".
  • step_per_episode: Number of timesteps in an episode.
  • initial_budget: Initial budget (i.e., constraint) for an episode.
  • n_ads: Number of ads used for auction bidding.
  • n_users: Number of users used for auction bidding.
  • ad_feature_dim: Dimension of the ad feature vectors.
  • user_feature_dim: Dimension of the user feature vectors.
  • ad_feature_vector: Feature vectors that characterizes each ad.
  • user_feature_vector: Feature vectors that characterizes each user.
  • ad_sampling_rate: Sampling probabilities to determine which ad (id) is used in each auction.
  • user_sampling_rate: Sampling probabilities to determine which user (id) is used in each auction.
  • WinningPriceDistribution: Winning price distribution of auctions.
  • ClickTroughRate: Click through rate (i.e., click / impression).
  • ConversionRate: Conversion rate (i.e., conversion / click).
  • standard_bid_price_distribution: Distribution of the bid price whose average impression probability is expected to be 0.5.
  • minimum_standard_bid_price: Minimum value for standard bid price.
  • search_volume_distribution: Search volume distribution for each timestep.
  • minimum_search_volume: Minimum search volume at each timestep.
  • random_state: Random state.
from rtbgym import RTBEnv
env = RTBEnv(
    objective="click",  # maximize the number of total impressions
    cost_indicator="click",  # cost arises every time click occurs
    step_per_episode=14,  # 14 days as an episode
    initial_budget=5000,  # budget available for 14 dayas is 5000

Specifically, users can define their own WinningPriceDistribution, ClickThroughRate, and ConversionRate as follows.

Example of Custom Winning Price Distribution

# import RTBGym modules
from rtbgym import BaseWinningPriceDistribution
from rtbgym.utils import NormalDistribution
# import other necessary stuffs
from dataclasses import dataclass
from typing import Optional, Union, Tuple
import numpy as np

class CustomizedWinningPriceDistribution(BaseWinningPriceDistribution):
    n_ads: int
    n_users: int
    ad_feature_dim: int
    user_feature_dim: int
    step_per_episode: int
    standard_bid_price_distribution: NormalDistribution = NormalDistribution(
    minimum_standard_bid_price: Optional[Union[int, float]] = None
    random_state: Optional[int] = None

    def __post_init__(self):
        self.random_ = check_random_state(self.random_state)

    def sample_outcome(
        bid_prices: np.ndarray,
    ) -> Tuple[np.ndarray]:
        """Stochastically determine impression and second price for each auction."""
        # sample winning price from simple normal distribution
        winning_prices = self.random_.normal(
            scale=self.standard_bid_price / 5,
        impressions = winning_prices < bid_prices
        return impressions.astype(int), winning_prices.astype(int)

    def standard_bid_price(self):
        return self.standard_bid_price_distribution.mean

Example of Custom ClickThroughRate (and Conversion Rate)

from rtbgym import BaseClickAndConversionRate
from rtbgym.utils import sigmoid

class CustomizedClickThroughRate(BaseClickAndConversionRate):
    n_ads: int
    n_users: int
    ad_feature_dim: int
    user_feature_dim: int
    step_per_episode: int
    random_state: Optional[int] = None

    def __post_init__(self):
        self.random_ = check_random_state(self.random_state)
        self.ad_coef = self.random_.normal(
            size=(self.ad_feature_dim, 10),
        self.user_coef = self.random_.normal(
            size=(self.user_feature_dim, 10),

    def calc_prob(
        ad_ids: np.ndarray,
        user_ids: np.ndarray,
        ad_feature_vector: np.ndarray,
        user_feature_vector: np.ndarray,
        timestep: Union[int, np.ndarray],
    ) -> np.ndarray:
        """Calculate CTR (i.e., click per impression)."""
        ad_latent = ad_feature_vector @ self.ad_coef
        user_latent = user_feature_vector @ self.user_coef
        ctrs = sigmoid((ad_latent * user_latent).mean(axis=1))
        return ctrs

    def sample_outcome(
        ad_ids: np.ndarray,
        user_ids: np.ndarray,
        ad_feature_vector: np.ndarray,
        user_feature_vector: np.ndarray,
        timestep: Union[int, np.ndarray],
    ) -> np.ndarray:
        """Stochastically determine whether click occurs in impression=True case."""
        ctrs = self.calc_prob(
        clicks = self.random_.rand(len(ad_ids)) < ctrs
        return clicks.astype(int)

Note that the custom conversion rate can be defined in a similar manner.

Wrapper class for custom bidding setup

To customize the bidding setup, we also provide CustomizedRTBEnv.

CustomizedRTBEnv enables discretization or re-definition of the action space. In addition, users can set their own reward_predictor.

List of arguments: (click to expand)
  • original_env: Original RTB Environment.
  • reward_predictor: A machine learning model to predict the reward to determine the bidding price.
  • scaler: Scaling factor (constant value) used for bid price determination. (None for the auto-fitting)
  • action_min: Minimum value of adjust rate.
  • action_max: Maximum value of adjust rate.
  • action_type: Action type of the RL agent, which is either "discrete" or "continuous".
  • n_actions: Number of "discrete" actions.
  • action_meaning: Mapping function of agent action index to the actual "discrete" action to take.
from rtbgym import CustomizedRTBEnv
custom_env = CustomizedRTBEnv(
    reward_predictor=None,  # use ground-truth (expected) reward as a reward predictor (oracle)

More examples are available at quickstart/rtb/rtb_synthetic_customize_env.ipynb.
The statistics of the environment are also visualized at quickstart/rtb/rtb_synthetic_data_collection.ipynb.

Finally, example usages for online/offline RL and OPE/OPS studies are available at quickstart/rtb/rtb_synthetic_discrete_basic.ipynb (discrete action space) and quickstart/rtb/rtb_synthetic_continuous_basic.ipynb (continuous action space).


If you use our software in your work, please cite our paper:

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation


  author = {Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nataka, Kazuhide and Saito, Yuta},
  title = {SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
  journal={arXiv preprint arXiv:2311.18206},
  year = {2023},


Any contributions to RTBGym are more than welcome! Please refer to for general guidelines on how to contribute the project.


This project is licensed under Apache 2.0 license - see LICENSE file for details.

Project Team

  • Haruka Kiyohara (Main Contributor; Cornell University)
  • Ren Kishimoto (Tokyo Institute of Technology)
  • Kosuke Kawakami (HAKUHODO Technologies Inc.)
  • Ken Kobayashi (Tokyo Institute of Technology)
  • Kazuhide Nakata (Tokyo Institute of Technology)
  • Yuta Saito (Cornell University)


For any questions about the paper and software, feel free to contact:


Papers (click to expand)
  1. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.

  2. Takuma Seno and Michita Imai. d3rlpy: An Offline Deep Reinforcement Library, arXiv preprint arXiv:2111.03788, 2021.

  3. Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 1443-1451, 2018.

  4. Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He. Deep Reinforcement Learning for Sponsored Search Real-time Bidding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1021-1030, 2018.

  5. Wen-Yuan Zhu, Wen-Yueh Shih, Ying-Hsuan Lee, Wen-Chih Peng, and Jiun-Long Huang. A Gamma-based Regression for Winning Price Estimation in Real-Time Bidding Advertising. In IEEE International Conference on Big Data, 1610-1619, 2017.

Projects (click to expand)

This project is inspired by the following three packages.

  • AuctionGym -- an RL environment for online advertising auctions: [github] [paper]
  • RecoGym -- an RL environment for recommender systems: [github] [paper]
  • RecSim -- a configurative RL environment for recommender systems: [github] [paper]
  • FinRL -- an RL environment for finance: [github] [paper]