You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am struggling to fix an issue with the ReverbAddTrajectoryObserver. I am constantly getting this error: [reverb/cc/trajectory_writer.cc:655] The number of pending items is alarmingly high, did you forget to call Flush? 130 items are waiting to be sent and 8 items have been sent to the server but haven't been confirmed yet. It is important to call Flush regularly as large numbers of pending items can result in OOM crashes on both client and server.
The code that I was using is very similar to the one by Schulman in tf_agents/examples/ppo/schulman17/train_eval_lib.py:
replay_buffer_capacity=1000000
replay_seq_len=100
batch_size=128
reverb_server = reverb.Server(
[
reverb.Table( # Replay buffer for training experience
name="training_table",
sampler=reverb.selectors.Uniform(),
remover=reverb.selectors.Fifo(),
rate_limiter=reverb.rate_limiters.MinSize(1),
max_times_sampled=1,
max_size=replay_buffer_capacity,
),
reverb.Table(
name="normalization_table",
sampler=reverb.selectors.Uniform(),
remover=reverb.selectors.Fifo(),
rate_limiter=reverb.rate_limiters.MinSize(1),
max_times_sampled=1,
max_size=replay_buffer_capacity,
)
]
)
reverb_replay_train = reverb_replay_buffer.ReverbReplayBuffer(
tf_agent.collect_data_spec,
sequence_length=replay_seq_len,
table_name='training_table',
server_address='localhost:{}'.format(reverb_server.port),
# The only collected sequence is used to populate the batches.
dataset_buffer_size=10*batch_size,
max_cycle_length=1,
rate_limiter_timeout_ms=1000,
num_workers_per_iterator=6,
)
reverb_replay_normalization = reverb_replay_buffer.ReverbReplayBuffer(
tf_agent.collect_data_spec,
sequence_length=replay_seq_len,
table_name='normalization_table',
server_address='localhost:{}'.format(reverb_server.port),
# The only collected sequence is used to populate the batches.
dataset_buffer_size=10*batch_size,
max_cycle_length=1,
rate_limiter_timeout_ms=1000,
num_workers_per_iterator=6,
)
rb_observer = reverb_utils.ReverbAddTrajectoryObserver(
reverb_replay_train.py_client,
['training_table', 'normalization_table'],
sequence_length=replay_seq_len,
)
# [Omitted code]
collect_actor = actor.Actor(
train_py_env,
collect_policy,
train_step,
steps_per_run=2*batch_size*replay_seq_len,
observers=[rb_observer],
metrics=actor.collect_metrics(buffer_size=10)
)
From my understanding, the problem is that the reverb server doesn't keep up with rate at which items are added to it, and thus I need to flush rb_observer constantly. I have added self.flush() in the call method of the ReverbAddTrajectoryObserver, but that makes the code 3x slower. Is there another way of fixing this? I am running all the code on a large server, so memory and cpu cores are not an issue. I just hate it that my error files have gigabyte dimensions because of the above error message. Also, the number of items waiting is always less than 250, so I tried bumping the limit up (although looks like I can't because trajectory_writer.cc is a C file that seems to get compiled upon installing the library...
Thanks!
The text was updated successfully, but these errors were encountered:
Hello everyone,
I am struggling to fix an issue with the ReverbAddTrajectoryObserver. I am constantly getting this error:
[reverb/cc/trajectory_writer.cc:655] The number of pending items is alarmingly high, did you forget to call Flush? 130 items are waiting to be sent and 8 items have been sent to the server but haven't been confirmed yet. It is important to call Flush regularly as large numbers of pending items can result in OOM crashes on both client and server.
The code that I was using is very similar to the one by Schulman in
tf_agents/examples/ppo/schulman17/train_eval_lib.py
:From my understanding, the problem is that the reverb server doesn't keep up with rate at which items are added to it, and thus I need to flush
rb_observer
constantly. I have addedself.flush()
in the call method of theReverbAddTrajectoryObserver
, but that makes the code 3x slower. Is there another way of fixing this? I am running all the code on a large server, so memory and cpu cores are not an issue. I just hate it that my error files have gigabyte dimensions because of the above error message. Also, the number of items waiting is always less than 250, so I tried bumping the limit up (although looks like I can't becausetrajectory_writer.cc
is a C file that seems to get compiled upon installing the library...Thanks!
The text was updated successfully, but these errors were encountered: