-
Notifications
You must be signed in to change notification settings - Fork 56
-
I have only seen the |
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment · 8 replies
-
sorry that's not supported at this moment. do you have a particular use case for it? if yes I'm happy to look into adding it. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm currently trying to train DBRX. Because this model is so large, we must use context parallel to avoid OOM. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I called model.tensor_model_parallel_size=8 \
model.pipeline_model_parallel_size=1 \
model.sequence_parallel=True \
+model.context_parallel_size=8 \ However, it appears that the context parallel size is not taking effect. In all the
|
Beta Was this translation helpful? Give feedback.
All reactions
-
ah okay, i think probably a flag is not being passed. I'm currently working on a version bump of aligner. I will try to dig into this |
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 1
-
hey @alex-ht I checked and it turns out that unfortunately MoE + CP is not a combination that is supported at the moment in the nemo ecosystem. CP is only supported when we have no MoE(that part I will try to add asap). |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
I am trying to train Llama3 model, but I have encountered another problem. First, I added CMD="python -u examples/nlp/gpt/train_gpt_sft.py \
trainer.precision=16-mixed \
trainer.num_nodes=${SLURM_JOB_NUM_NODES} \
trainer.devices=-1 \
trainer.sft.limit_val_batches=1 \
trainer.sft.val_check_interval=12 \
trainer.sft.save_interval=12 \
model.megatron_amp_O2=True \
model.restore_from_path=${PRETRAINED_ACTOR_NEMO_FILE} \
model.optim.lr=5e-6 \
model.optim.name=distributed_fused_adam \
model.answer_only_loss=True \
model.data.num_workers=0 \
model.data.train_ds.micro_batch_size=1 \
model.data.train_ds.global_batch_size=8 \
model.data.train_ds.file_path=${TRAIN_DATA_PATH} \
model.data.train_ds.max_seq_length=8192 \
model.data.train_ds.add_eos=True \
model.data.train_ds.add_bos=True \
+model.data.train_ds.packed_sequence=True \
model.data.validation_ds.max_seq_length=8192 \
model.data.validation_ds.micro_batch_size=1 \
model.data.validation_ds.global_batch_size=128 \
model.data.validation_ds.file_path=${VALID_DATA_PATH} \
model.data.validation_ds.add_bos=True \
model.data.validation_ds.add_eos=True \
+model.data.validation_ds.packed_sequence=True \
model.tensor_model_parallel_size=8 \
model.pipeline_model_parallel_size=4 \
model.sequence_parallel=False \
model.activations_checkpoint_granularity=selective \
model.activations_checkpoint_method=uniform \
+model.context_parallel_size=2 \
exp_manager.create_wandb_logger=True \
exp_manager.explicit_log_dir=${RESULTS_DIR} \
exp_manager.wandb_logger_kwargs.project=${PROJECT} \
exp_manager.wandb_logger_kwargs.name=dolly_sft_run_tp8 \
exp_manager.resume_if_exists=True \
exp_manager.resume_ignore_no_checkpoint=True \
exp_manager.create_checkpoint_callback=True \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
exp_manager.checkpoint_callback_params.monitor=validation_loss \
" I ran into the following error when starting the training
besides, the TransformerEngine I am using has been modified to incorporate xformers to support the V100 GPU. This program has passed the test at I can't find examples of using context parallel, and any would be very helpful. |
Beta Was this translation helpful? Give feedback.
All reactions
-
hello, i talked to the CP expert here and they told me this seems more like a torch compile issue. To help rule it out can you let me know if llama3 works if you don't use CP at all? Unfortunately the only docs we have for CP is here: https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html |
Beta Was this translation helpful? Give feedback.
All reactions
-
Yes, it can run normally without CP. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I have placed the log messages and scripts here. The first experiment (without CP) encountered an error while saving the checkpoint, and the second experiment (with CP) had an error right at the start of training. |
Beta Was this translation helpful? Give feedback.
hey @alex-ht I checked and it turns out that unfortunately MoE + CP is not a combination that is supported at the moment in the nemo ecosystem. CP is only supported when we have no MoE(that part I will try to add asap).