Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when I use nvme to offload param and optimizer I meet a bug[BUG] #3376

Closed
etoilestar opened this issue Apr 25, 2023 · 21 comments
Closed

when I use nvme to offload param and optimizer I meet a bug[BUG] #3376

etoilestar opened this issue Apr 25, 2023 · 21 comments
Assignees
Labels
bug Something isn't working training

Comments

@etoilestar
Copy link

python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertion static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed. /nvme/zero_stage_3/optimizer/rank6/139649992513552.tensor.swp: buffer nbytes != file bytes 4001366016 != 3426746368 python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertion static_cast(buffer.nbytes()) == num_file_bytes' failed.
/nvme/zero_stage_3/optimizer/rank2/139929715382368.tensor.swp: buffer nbytes != file bytes 4001366016 != 3599761408
/nvme/zero_stage_3/optimizer/rank1/140296723433568.tensor.swp: buffer nbytes != file bytes python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertion `static_cast(buffer.nbytes()) == num_file_bytes' failed.
4001366016 != 3539992576

@etoilestar etoilestar added bug Something isn't working training labels Apr 25, 2023
@tjruwase
Copy link
Contributor

@etoilestar, can you share more details to repro this issue? In the meantime, can you confirm that /nvme/zero_stage_3/ is empty before running.

@tjruwase tjruwase self-assigned this Apr 25, 2023
@etoilestar
Copy link
Author

Yes, I emptied this folder before I ran the code, what kind of information should I provide, can you give me a hint?

@etoilestar
Copy link
Author

I use 8 3090 graphics cards, and the code I execute is deepspeed_megatron to train gpt3. When I increase the buffer_count, this error will disappear, but it will freeze during the preprocessing process.

@ReyRen
Copy link

ReyRen commented Apr 26, 2023

Describe the bug

Hi @tjruwase, really thanks for join us. I have totally same problem with @etoilestar.Please let me give some more details.
I noticed the "buffer nbytes != file xxx" error already patched in #2002, and the version of deepspeed I used is latest one. But this problem also occurred.

To Reproduce

git clone https://github.com/microsoft/Megatron-DeepSpeed/
cd Megatron-DeepSpeed /examples/run_deepspeed_exam
# I already attatched modifed run_deepspeed_example.sh
/bin/bash modifed run_deepspeed_example.sh

Then, "buffer nbytes != file xxx" occurred.

System info (please complete the following information):

  • python 3.8.10
  • system: ubuntu:18.04
  • images : nvcr.io/nvidia/pytorch:22.12-py3
  • deepspeed: pip install deepspeed (0.9.1)
  • NVME: already formatted with ext4(not empty) on host and mounted into container

Thanks!
script.zip

@tjruwase
Copy link
Contributor

tjruwase commented May 2, 2023

@ReyRen, could you please share your log as well?

@tjruwase
Copy link
Contributor

tjruwase commented May 3, 2023

@etoilestar and @ReyRen, I am trying to repro this issue. I am using a 4xV100-16GB which is probably different from your setups.
Can you please share you stack trace as well? Thanks!

@etoilestar
Copy link
Author

hello,thanks for your reply, he get the same log as me, here is my log:

**[2023-05-04 01:51:44,732] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-04 01:51:45,824] [INFO] [runner.py:540:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 96 --hidden-size 3072 --num-attention-heads 96 --seq-length 2048 --loss-scale 12 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 8 --train-iters 1000 --lr 6.0e-5 --min-lr 6.0e-6 --lr-decay-style cosine --log-interval 1 --eval-iters 40 --eval-interval 1000 --data-path ../dataset/my-gpt2_text_document --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --save-interval 1000 --split 98,2,0 --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.006 --fp16 --checkpoint-activations --tensorboard-dir ds_z3_nl96_hs3072_gb8_mb1 --cpu-optimizer --deepspeed-activation-checkpointing --zero-stage=3 --deepspeed_config=ds_config.json --no-pipeline-parallel --deepspeed --exit-interval 5000
[2023-05-04 01:51:48,924] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.15.5
[2023-05-04 01:51:48,924] [INFO] [launch.py:222:main] 0 NCCL_DEBUG=warn
[2023-05-04 01:51:48,924] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-04 01:51:48,924] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-04 01:51:48,924] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-04 01:51:48,924] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-04 01:51:48,924] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible

async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible

async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... ninja�[92m[OKAY]�[0m
.................. �[92m[OKAY]�[0m

op name ................ installed .. compatiblecpu_adam

............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. �[92m[OKAY]�[0m

op name ................ installed .. compatible

spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
async_io ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adagrad ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
cpu_adam ............... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_adam ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
fused_lamb ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
quantizer .............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
random_ltd ............. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
�[93m [WARNING] �[0m please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ �[93m[NO]�[0m ....... �[93m[NO]�[0m
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
using torch.float16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
aml_data_download_path .......................... None
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
bert_binary_head ................................ True
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
checkpoint_activations .......................... True
checkpoint_in_cpu ............................... False
checkpoint_num_layers ........................... 1
clip_grad ....................................... 1.0
compression_training ............................ False
consumed_train_samples .......................... 0
consumed_train_tokens ........................... 0
consumed_valid_samples .......................... 0
contigious_checkpointing ........................ False
cpu_optimizer ................................... True
cpu_torch_adam .................................. False
create_moe_param_group .......................... False
curriculum_learning_legacy ...................... False
custom_token_counting ........................... False
data_efficiency_curriculum_learning ............. False
data_impl ....................................... infer
data_parallel_size .............................. 8
data_path ....................................... ['../dataset/my-gpt2_text_document']
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_seq_length .............................. None
deepscale ....................................... False
deepscale_config ................................ None
deepspeed ....................................... True
deepspeed_activation_checkpointing .............. True
deepspeed_config ................................ ds_config.json
deepspeed_mpi ................................... False
distribute_checkpointed_activations ............. False
distributed_backend ............................. nccl
ds_inference .................................... False
ds_pipeline_enabled ............................. False
embedding_path .................................. None
enable_expert_tensor_parallelism ................ False
encoder_seq_length .............................. 2048
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 40
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... 5000
expert_interval ................................. 2
ffn_hidden_size ................................. 12288
finetune ........................................ False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
global_batch_size ............................... 8
hidden_dropout .................................. 0.1
hidden_size ..................................... 3072
hidden_size_teacher ............................. None
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_dim ......................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference ....................................... False
init_method_std ................................. 0.006
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
kd .............................................. False
kd_alpha_ce ..................................... 1
kd_beta_ce ...................................... 1
kd_temp ......................................... 1.0
kv_channels ..................................... 32
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ None
load_teacher .................................... None
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_num_zeros_in_grad ........................... False
log_optimizer_states_to_tensorboard ............. False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
loss_scale ...................................... 12.0
loss_scale_window ............................... 1000
lr .............................................. 6e-05
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_decay_tokens ................................. None
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
lr_warmup_tokens ................................ None
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 2048
memory_centric_tiled_linear ..................... False
merge_file ...................................... gpt2-merges.txt
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 6e-06
mlp_type ........................................ standard
mmap_warmup ..................................... False
moe_eval_capacity_factor ........................ 1.0
moe_expert_parallel_size ........................ 1
moe_loss_coeff .................................. 0.1
moe_min_capacity ................................ 4
moe_token_dropping .............................. True
moe_train_capacity_factor ....................... 1.0
mos ............................................. False
no_load_lr_state ................................ False
no_load_optim ................................... None
no_load_rng ..................................... None
no_pipeline_parallel ............................ True
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 96
num_attention_heads_teacher ..................... None
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... [1]
num_experts_teacher ............................. [1]
num_layers ...................................... 96
num_layers_per_virtual_pipeline_stage ........... None
num_layers_teacher .............................. None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
override_lr_scheduler ........................... False
params_dtype .................................... torch.float16
partition_activations ........................... False
patch_dim ....................................... 16
pipeline_model_parallel_size .................... 1
profile_backward ................................ False
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
random_ltd ...................................... False
rank ............................................ 0
remote_device ................................... none
reset_attention_mask ............................ False
reset_iteration ................................. False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
return_data_index ............................... False
sample_rate ..................................... 1.0
save ............................................ None
save_interval ................................... 1000
scatter_gather_tensors_in_pipeline .............. True
scattered_embeddings ............................ False
seed ............................................ 1234
seq_length ...................................... 2048
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... 98,2,0
split_transformers .............................. False
synchronize_each_layer .......................... False
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. ds_z3_nl96_hs3072_gb8_mb1
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
tile_factor ..................................... 1
titles_data_path ................................ None
tokenizer_type .................................. GPT2BPETokenizer
topk ............................................ 1
train_data_exact_num_epochs ..................... None
train_doc_idx_path .............................. None
train_idx_path .................................. None
train_iters ..................................... 1000
train_sample_idx_path ........................... None
train_samples ................................... None
train_shuffle_idx_path .......................... None
train_tokens .................................... None
use_checkpoint_lr_scheduler ..................... False
use_contiguous_buffers_in_ddp ................... False
use_cpu_initialization .......................... None
use_one_sent_docs ............................... False
use_pin_memory .................................. False
use_tutel ....................................... False
virtual_pipeline_model_parallel_size ............ None
vocab_extra_ids ................................. 0
vocab_file ...................................... gpt2-vocab.json
weight_decay .................................... 0.1
world_size ...................................... 8
zero_allgather_bucket_size ...................... 0.0
zero_contigious_gradients ....................... False
zero_reduce_bucket_size ......................... 0.0
zero_reduce_scatter ............................. False
zero_stage ...................................... 3
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1

building GPT2BPETokenizer tokenizer ...
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
initializing torch distributed ...
[2023-05-04 01:51:54,265] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
spatial_inference ...... �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer ............ �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
stochastic_transformer . �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m


transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

transformer_inference .. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m
utils .................. �[93m[NO]�[0m ....... �[92m[OKAY]�[0m

DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.14.0a0+410ce96
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****

setting tensorboard ...
initializing tensor model parallel with size 1
initializing pipeline model parallel with size 1
setting random seeds to 1234 ...
[2023-05-04 01:51:55,565] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
compiling dataset index builder ...
make: Entering directory '/workspace/megatron/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/megatron/megatron/data'

done with dataset index builder. Compilation time: 0.110 seconds
compiling and loading fused kernels ...
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
NCCL version 2.15.5+cuda11.8
done with compiling and loading fused kernels. Compilation time: 6.573 seconds
time to initialize megatron (seconds): 66.251
[after megatron is initialized] datetime: 2023-05-04 01:52:02
building GPT model ...
[2023-05-04 01:52:02,435] [INFO] [utils.py:785:see_memory_usage] Before Building Model
[2023-05-04 01:52:02,436] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2023-05-04 01:52:02,437] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 23.49 GB, percent = 2.3%
[2023-05-04 01:52:06,274] [INFO] [utils.py:30:print_object] AsyncPartitionedParameterSwapper:
[2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aio_handle ................... <class 'async_io.aio_handle'>
[2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024
[2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aligned_elements_per_buffer .. 100000256
[2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] available_buffer_ids ......... [0, 1, 2, 3, 4]
[2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] available_numel .............. 0
[2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] available_params ............. set()
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] dtype ........................ torch.float16
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] elements_per_buffer .......... 100,000,000
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] id_to_path ................... {}
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] inflight_numel ............... 0
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] inflight_params .............. []
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] inflight_swap_in_buffers ..... []
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] invalid_buffer ............... 1.0
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] numel_alignment .............. 512
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_buffer_count ........... 5
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_id_to_buffer_id ........ {}
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_id_to_numel ............ {}
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_id_to_swap_buffer ...... {}
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] partitioned_swap_buffer ...... None
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] partitioned_swap_pool ........ None
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] pending_reads ................ 0
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] pending_writes ............... 0
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] reserved_buffer_ids .......... []
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_config .................. device='nvme' nvme_path=PosixPath('/nvme') buffer_count=5 buffer_size=100,000,000 max_in_cpu=1,000,000,000 pin_memory=True
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_element_size ............ 2
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_folder .................. /nvme/zero_stage_3/float16params/rank0
[2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_out_params .............. []
[2023-05-04 01:52:23,724] [INFO] [partition_parameters.py:454:exit] finished initializing model with 11.04B parameters
[2023-05-04 01:52:23,877] [INFO] [utils.py:785:see_memory_usage] After Building Model
[2023-05-04 01:52:23,878] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.29 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:52:23,878] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 32.4 GB, percent = 3.2%
number of parameters on (tensor, pipeline) model parallel rank (0, 0): 11036301312
ninja: no work to do.
Time to load cpu_adam op: 2.895385980606079 seconds
Time to load cpu_adam op: 2.773625612258911 seconds
Time to load cpu_adam op: 2.7883458137512207 seconds
ninja: no work to do.
Time to load cpu_adam op: 3.0697195529937744 seconds
Time to load cpu_adam op: 3.0779221057891846 seconds
Time to load cpu_adam op: 3.117755889892578 seconds
Time to load cpu_adam op: 3.1241681575775146 seconds
Time to load cpu_adam op: 3.1538240909576416 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
learning rate decay style: cosine
DeepSpeed is enabled.
[2023-05-04 01:52:30,247] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-05-04 01:52:30,292] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-05-04 01:52:30,295] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-05-04 01:52:30,295] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
[2023-05-04 01:52:30,454] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-05-04 01:52:30,454] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-05-04 01:52:30,454] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1
[2023-05-04 01:52:30,568] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
[2023-05-04 01:52:30,569] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:52:30,569] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.41 GB, percent = 4.7%
[2023-05-04 01:52:30,573] [INFO] [stage3.py:113:init] Reduce bucket size 90000000
[2023-05-04 01:52:30,573] [INFO] [stage3.py:114:init] Prefetch bucket size 50000000
ninja: no work to do.
Time to load utils op: 0.2917904853820801 seconds
Time to load utils op: 0.1040353775024414 seconds
[2023-05-04 01:52:30,774] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-05-04 01:52:30,775] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:52:30,775] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.41 GB, percent = 4.7%
Parameter Offload: Total persistent parameters: 3840000 in 770 params
[2023-05-04 01:52:30,912] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-05-04 01:52:30,913] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:52:30,913] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.42 GB, percent = 4.7%
ninja: no work to do.
Time to load utils op: 0.2980797290802002 seconds
Time to load utils op: 0.6072814464569092 seconds
Time to load utils op: 0.3056302070617676 seconds
[2023-05-04 01:52:31,009] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions
[2023-05-04 01:52:31,010] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:52:31,010] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.41 GB, percent = 4.7%
Time to load utils op: 0.6070020198822021 seconds
Time to load utils op: 0.4060196876525879 seconds
Time to load utils op: 0.40641093254089355 seconds
[2023-05-04 01:52:40,180] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 15
[2023-05-04 01:52:40,181] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:52:40,181] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 86.18 GB, percent = 8.6%
[2023-05-04 01:52:40,181] [INFO] [stage3.py:467:_configure_tensor_swapping] Tensor Swapping: Adding optimizer tensors
[2023-05-04 01:52:42,207] [INFO] [utils.py:30:print_object] SwapBufferManager:
[2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] count ........................ 4
[2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32
[2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] free_buffer_index ............ [0, 1, 2, 3]
[2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] gigabytes .................... 1.546875
[2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] num_elems .................... 103809024
[2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] used_buffer_index ............ {}
Time to load async_io op: 2.539290189743042 seconds
Time to load async_io op: 2.6841211318969727 seconds
Time to load async_io op: 2.691549777984619 seconds
Time to load async_io op: 2.706519365310669 seconds
Time to load async_io op: 2.7506959438323975 seconds
Time to load async_io op: 2.815930128097534 seconds
Time to load async_io op: 2.7707629203796387 seconds
Time to load async_io op: 2.797130584716797 seconds
[2023-05-04 01:52:45,262] [INFO] [utils.py:30:print_object] PartitionedOptimizerSwapper:
[2023-05-04 01:52:45,262] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] largest_numel ................ 103809024
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] numel_alignment .............. 256
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] swap_config .................. device='nvme' nvme_path=PosixPath('/nvme') buffer_count=4 pin_memory=True pipeline=False pipeline_read=False pipeline_write=False fast_init=False
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] swap_element_size ............ 4
[2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] swap_folder .................. /nvme/zero_stage_3/optimizer/rank0
[2023-05-04 01:52:45,528] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions
[2023-05-04 01:52:45,529] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:52:45,529] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 97.61 GB, percent = 9.7%
[2023-05-04 01:53:02,744] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions
[2023-05-04 01:53:02,745] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:53:02,745] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 96.41 GB, percent = 9.6%
[2023-05-04 01:53:02,933] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-05-04 01:53:02,934] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:53:02,934] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 99.09 GB, percent = 9.8%
[2023-05-04 01:54:23,873] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states
[2023-05-04 01:54:23,874] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB
[2023-05-04 01:54:23,874] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 99.36 GB, percent = 9.9%
[2023-05-04 01:54:24,930] [INFO] [stage3.py:366:_setup_for_real_optimizer] optimizer state initialized
Time to load utils op: 0.0007295608520507812 seconds
Time to load utils op: 0.0007138252258300781 seconds
Time to load utils op: 0.0008652210235595703 seconds
Time to load utils op: 0.0006787776947021484 seconds
Time to load utils op: 0.0007004737854003906 seconds
Time to load utils op: 0.0007026195526123047 seconds
Time to load utils op: 0.0013163089752197266 seconds
[2023-05-04 01:54:39,709] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2023-05-04 01:54:39,710] [INFO] [utils.py:786:see_memory_usage] MA 0.17 GB Max_MA 0.74 GB CA 1.16 GB Max_CA 1 GB
[2023-05-04 01:54:39,710] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 129.04 GB, percent = 12.8%
[2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
[2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7f6b2a433e20>
[2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5.9999999999999995e-05, 5.9999999999999995e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-05-04 01:54:39,713] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-05-04 01:54:39,713] [INFO] [config.py:957:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-05-04 01:54:39,713] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-04 01:54:39,713] [INFO] [config.py:957:print] amp_enabled .................. False
[2023-05-04 01:54:39,713] [INFO] [config.py:957:print] amp_params ................... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] bfloat16_enabled ............. False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f6b1af9df70>
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] communication_data_type ...... None
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] curriculum_params_legacy ..... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] data_efficiency_enabled ...... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] dataloader_drop_last ......... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] disable_allgather ............ False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] dump_state ................... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 4096, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_enabled ........... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_verbose ........... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] elasticity_enabled ........... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] fp16_auto_cast ............... False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] fp16_enabled ................. True
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] global_rank .................. 0
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] grad_accum_dtype ............. None
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] gradient_accumulation_steps .. 1
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] gradient_clipping ............ 0.0
[2023-05-04 01:54:39,714] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] initial_dynamic_scale ........ 4096
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] load_universal_checkpoint .... False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] loss_scale ................... 0
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] memory_breakdown ............. False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] optimizer_name ............... None
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] optimizer_params ............. None
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] pld_enabled .................. False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] pld_params ................... False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] prescale_gradients ........... False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] scheduler_name ............... None
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] scheduler_params ............. None
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] sparse_attention ............. None
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] steps_per_print .............. 1
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] train_batch_size ............. 8
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 1
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] use_node_local_storage ....... False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] wall_clock_breakdown ......... False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] world_size ................... 8
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_allow_untested_optimizer False
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=90000000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='nvme', nvme_path=PosixPath('/nvme'), buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='nvme', nvme_path=PosixPath('/nvme'), buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=sys.maxsize max_live_parameters=3000000000 max_reuse_distance=3000000000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=True
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_enabled ................. True
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True
[2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_optimization_stage ...... 3
[2023-05-04 01:54:39,716] [INFO] [config.py:943:print_user_config] json = {
"train_batch_size": 8,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 3.000000e+09,
"stage3_max_reuse_distance": 3.000000e+09,
"stage3_param_persistence_threshold": 1.000000e+05,
"stage3_prefetch_bucket_size": 5.000000e+07,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_bucket_size": 9.000000e+07,
"sub_group_size": 1.000000e+08,
"offload_param": {
"device": "nvme",
"nvme_path": "/nvme",
"pin_memory": true
},
"offload_optimizer": {
"device": "nvme",
"pipeline_read": false,
"pipeline_write": false,
"nvme_path": "/nvme",
"pin_memory": true
}
},
"fp16": {
"enabled": true,
"initial_scale_power": 12
}
}
Time to load utils op: 0.00044536590576171875 seconds
[after model, optimizer, and learning rate scheduler are built] datetime: 2023-05-04 01:54:39
building train, validation, and test datasets ...
datasets target sizes (minimum size):
train: 8000
validation: 640
test: 320
building train, validation, and test datasets for GPT ...
building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
finished creating indexed dataset in 0.000360 seconds
number of documents: 250000
dataset split:
train:
document indices in [0, 245000) total of 245000 documents
validation:
document indices in [245000, 250000) total of 5000 documents
test:
document indices in [250000, 250000) total of 0 documents
NCCL version 2.15.5+cuda11.8
NCCL version 2.15.5+cuda11.8
NCCL version 2.15.5+cuda11.8
NCCL version 2.15.5+cuda11.8
NCCL version 2.15.5+cuda11.8
NCCL version 2.15.5+cuda11.8
NCCL version 2.15.5+cuda11.8
loading doc-idx mapping from ../dataset/my-gpt2_text_document_train_indexmap_8000ns_2048sl_1234s_doc_idx.npy
loading sample-idx mapping from ../dataset/my-gpt2_text_document_train_indexmap_8000ns_2048sl_1234s_sample_idx.npy
loading shuffle-idx mapping from ../dataset/my-gpt2_text_document_train_indexmap_8000ns_2048sl_1234s_shuffle_idx.npy
loaded indexed file in 0.002 seconds
total number of samples: 70128
total number of epochs: 1
loading doc-idx mapping from ../dataset/my-gpt2_text_document_valid_indexmap_640ns_2048sl_1234s_doc_idx.npy
loading sample-idx mapping from ../dataset/my-gpt2_text_document_valid_indexmap_640ns_2048sl_1234s_sample_idx.npy
loading shuffle-idx mapping from ../dataset/my-gpt2_text_document_valid_indexmap_640ns_2048sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 1443
total number of epochs: 1
finished creating GPT datasets ...
[after dataloaders are built] datetime: 2023-05-04 01:54:40
time (ms) | model-and-optimizer-setup: 157334.23 | train/valid/test-data-iterators-setup: 682.76
done with setup ...
training ...
[before the start of training step] datetime: 2023-05-04 01:54:40
[2023-05-04 01:54:40,562] [INFO] [checkpointing.py:529:forward] Activation Checkpointing Information
[2023-05-04 01:54:40,562] [INFO] [checkpointing.py:530:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2023-05-04 01:54:40,562] [INFO] [checkpointing.py:531:forward] ----contiguous Memory Checkpointing False with 96 total layers
[2023-05-04 01:54:40,562] [INFO] [checkpointing.py:533:forward] ----Synchronization False
[2023-05-04 01:54:40,562] [INFO] [checkpointing.py:534:forward] ----Profiling time in checkpointing False
/nvme/zero_stage_3/optimizer/rank4/140370293727776_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
/nvme/zero_stage_3/optimizer/rank6/139716763042496_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
/nvme/zero_stage_3/optimizer/rank5/140154453562048_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
/nvme/zero_stage_3/optimizer/rank3/139926213575360_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
/nvme/zero_stage_3/optimizer/rank0/140093515555200_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
/nvme/zero_stage_3/optimizer/rank1/139730926629568_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
/nvme/zero_stage_3/optimizer/rank7/139738534742752_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
/nvme/zero_stage_3/optimizer/rank2/140168675173056_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0
**

@etoilestar
Copy link
Author

Also,sometimes, the process freezes when running the same script.

@tjruwase
Copy link
Contributor

tjruwase commented May 4, 2023

@etoilestar, thanks for sharing your log. Can you please do the following:

  1. Share the stack trace of the failure if possible.
  2. Share the size of /nvme/zero_stage_3/optimizer/rank4/140370293727776_gradient_94420992_1179648.tensor.swp.
  3. Try a smaller model by reducing the number of layers from 96 to 8.

@etoilestar
Copy link
Author

hello, could you tell me how to get the stack trace? the size of this file is 0, and I just want to use disk to train a larger model. thanks

@tjruwase
Copy link
Contributor

tjruwase commented May 5, 2023

The stack trace should be printed alongside error message and shows the code path leading to the failure.

A file size of 0 means the previous file write (creation) failed. Can you try running with a smaller model as suggested?

@etoilestar
Copy link
Author

yes, when I reduce the number of layers to 8, the program runs normally.

@tjruwase
Copy link
Contributor

tjruwase commented May 8, 2023

In that case, I am curious whether failure is filesystem problem, such as running out of disk space. How large is the offload folder?

@etoilestar
Copy link
Author

it is around 10T, I guess this bug is caused by the nvme is not as fast as expected.

@tjruwase
Copy link
Contributor

Ideally, nvme speed should affect throughput but not cause failures.

If you would like to continue this investigation can you please do the following?

  1. You can use the following to measure the nvme performance: [doc] profiling NVMe and configuring aio param section #998 (comment)
  2. Can you use binary search to increase the number of layers from 8 until you hit the failure? Then we can debug from there.

@etoilestar
Copy link
Author

Okay, I will try again later.

@etoilestar
Copy link
Author

there is another situation,when I increase buffer_count from 4 to 96, the size of .swp file is not zero, yet the the process freezes.

@etoilestar
Copy link
Author

Maybe you can take it into consideration.

@etoilestar
Copy link
Author

hello,it seems that you did not finish vit model with PP/TP in https://github.com/microsoft/Megatron-DeepSpeed, I recently tried to write this code, can you give me some advice?

@tjruwase
Copy link
Contributor

@etoilestar, apologies for the silence. Are you still interested in this issue? Thanks!

@etoilestar
Copy link
Author

thank you, I focus on another part of your project, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants