How to enable context parallelism? #150

alex-ht · 2024-04-11T02:02:26Z

alex-ht
Apr 11, 2024

I have only seen the --context-parallel-size arg in Megatron-LM. How can it be used in NeMo or NeMo Aligner?

Apr 17, 2024

hey @alex-ht I checked and it turns out that unfortunately MoE + CP is not a combination that is supported at the moment in the nemo ecosystem. CP is only supported when we have no MoE(that part I will try to add asap).

View full answer

gshennvm · 2024-04-14T07:13:26Z

gshennvm
Apr 14, 2024
Maintainer

sorry that's not supported at this moment. do you have a particular use case for it? if yes I'm happy to look into adding it.

8 replies

alex-ht Apr 14, 2024
Author

I'm currently trying to train DBRX. Because this model is so large, we must use context parallel to avoid OOM.
To my knowledge, in the file examples/nlp/gpt/conf/gpt_sft.yaml, the model section can have the parameter context_parallel_size: CP_SIZE added. This value is passed to here. Is this correct?

alex-ht Apr 17, 2024
Author

I called examples/nlp/gpt/train_gpt_sft.py with the following parameters:

   model.tensor_model_parallel_size=8 \
   model.pipeline_model_parallel_size=1 \
   model.sequence_parallel=True \
  +model.context_parallel_size=8 \

However, it appears that the context parallel size is not taking effect. In all the nemo_log_globalrank-XX_localrank-0.txt files, the context parallel rank is always 0, as shown below:

[NeMo I 2024-04-16 23:10:05 megatron_init:251] Rank 27 has data parallel group : [3, 11, 19, 27, 35, 43, 51, 59]
[NeMo I 2024-04-16 23:10:05 megatron_init:257] Rank 27 has combined group of data parallel and context parallel : [3, 11, 19, 27, 35, 43, 51, 59]
[NeMo I 2024-04-16 23:10:05 megatron_init:262] All data parallel group ranks with context parallel combined: [[0, 8, 16, 24, 32, 40, 48, 56], [1, 9, 17, 25, 33, 41, 49, 57], [2, 10, 18, 26, 34, 42, 50, 58], [3, 11, 19, 27, 35, 43, 51, 59], [4, 12, 20, 28, 36, 44, 52, 60], [5, 13, 21, 29, 37, 45, 53, 61], [6, 14, 22, 30, 38, 46, 54, 62], [7, 15, 23, 31, 39, 47, 55, 63]]
[NeMo I 2024-04-16 23:10:05 megatron_init:265] Ranks 27 has data parallel rank: 3
[NeMo I 2024-04-16 23:10:05 megatron_init:282] Rank 27 has context parallel group: [27]
[NeMo I 2024-04-16 23:10:05 megatron_init:285] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63]]
[NeMo I 2024-04-16 23:10:05 megatron_init:286] Ranks 27 has context parallel rank: 0
[NeMo I 2024-04-16 23:10:05 megatron_init:297] Rank 27 has model parallel group: [24, 25, 26, 27, 28, 29, 30, 31]
[NeMo I 2024-04-16 23:10:05 megatron_init:298] All model parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47], [48, 49, 50, 51, 52,

 53, 54, 55], [56, 57, 58, 59, 60, 61, 62, 63]]
[NeMo I 2024-04-16 23:10:05 megatron_init:313] Rank 27 has tensor model parallel rank: 3
[NeMo I 2024-04-16 23:10:05 megatron_init:342] Rank 27 has pipeline model parallel group: [27]
[NeMo I 2024-04-16 23:10:05 megatron_init:354] Rank 27 has embedding group: [27]
[NeMo I 2024-04-16 23:10:05 megatron_init:360] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63]]
[NeMo I 2024-04-16 23:10:05 megatron_init:361] Rank 27 has pipeline model parallel rank 0
[NeMo I 2024-04-16 23:10:05 megatron_init:362] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63]]
[NeMo I 2024-04-16 23:10:05 megatron_init:363] Rank 27 has embedding rank: 0

gshennvm Apr 17, 2024
Maintainer

ah okay, i think probably a flag is not being passed. I'm currently working on a version bump of aligner. I will try to dig into this

gshennvm Apr 17, 2024
Maintainer

hey @alex-ht I checked and it turns out that unfortunately MoE + CP is not a combination that is supported at the moment in the nemo ecosystem. CP is only supported when we have no MoE(that part I will try to add asap).

Answer selected by alex-ht

alex-ht Apr 22, 2024
Author

I am trying to train Llama3 model, but I have encountered another problem. First, I added gpt_cfg.context_parallel_size = cfg.model.get("context_parallel_size", 1) to train_gpt_sft.py to ensure that parameters can be passed. Next, I used the following parameters to call the training program:

CMD="python -u examples/nlp/gpt/train_gpt_sft.py \
   trainer.precision=16-mixed \
   trainer.num_nodes=${SLURM_JOB_NUM_NODES} \
   trainer.devices=-1 \
   trainer.sft.limit_val_batches=1 \
   trainer.sft.val_check_interval=12 \
   trainer.sft.save_interval=12 \
   model.megatron_amp_O2=True \
   model.restore_from_path=${PRETRAINED_ACTOR_NEMO_FILE} \
   model.optim.lr=5e-6 \
   model.optim.name=distributed_fused_adam \
   model.answer_only_loss=True \
   model.data.num_workers=0 \
   model.data.train_ds.micro_batch_size=1 \
   model.data.train_ds.global_batch_size=8 \
   model.data.train_ds.file_path=${TRAIN_DATA_PATH} \
   model.data.train_ds.max_seq_length=8192 \
   model.data.train_ds.add_eos=True \
   model.data.train_ds.add_bos=True \
   +model.data.train_ds.packed_sequence=True \
   model.data.validation_ds.max_seq_length=8192 \
   model.data.validation_ds.micro_batch_size=1 \
   model.data.validation_ds.global_batch_size=128 \
   model.data.validation_ds.file_path=${VALID_DATA_PATH} \
   model.data.validation_ds.add_bos=True \
   model.data.validation_ds.add_eos=True \
   +model.data.validation_ds.packed_sequence=True \
   model.tensor_model_parallel_size=8 \
   model.pipeline_model_parallel_size=4 \
   model.sequence_parallel=False \
   model.activations_checkpoint_granularity=selective \
   model.activations_checkpoint_method=uniform \
   +model.context_parallel_size=2 \
   exp_manager.create_wandb_logger=True \
   exp_manager.explicit_log_dir=${RESULTS_DIR} \
   exp_manager.wandb_logger_kwargs.project=${PROJECT} \
   exp_manager.wandb_logger_kwargs.name=dolly_sft_run_tp8 \
   exp_manager.resume_if_exists=True \
   exp_manager.resume_ignore_no_checkpoint=True \
   exp_manager.create_checkpoint_callback=True \
   exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
   exp_manager.checkpoint_callback_params.monitor=validation_loss \
   "

I ran into the following error when starting the training

Error executing job with overrides: ['trainer.precision=16-mixed', 'trainer.num_nodes=16', 'trainer.devices=-1', 'trainer.sft.limit_val_batches=1', 'trainer.sft.val_check_interval=12', 'trainer.sft.save_interval=12', 'model.megatron_amp_O2=True', 'model.restore_from_path=/home/u9824269/LLM/llama3/train/llama3-70b-chat-16-mixed.nemo', 'model.optim.lr=5e-6', 'model.optim.name=distributed_fused_adam', 'model.answer_only_loss=True', 'model.data.num_workers=0', 'model.data.train_ds.micro_batch_size=1', 'model.data.train_ds.global_batch_size=8', 'model.data.train_ds.file_path=/home/u9824269/LLM/nemo/databricks-dolly-15k-output.jsonl', 'model.data.train_ds.max_seq_length=8192', 'model.data.train_ds.add_eos=True', 'model.data.train_ds.add_bos=True', '+model.data.train_ds.packed_sequence=True', 'model.data.validation_ds.max_seq_length=8192', 'model.data.validation_ds.micro_batch_size=1', 'model.data.validation_ds.global_batch_size=128', 'model.data.validation_ds.file_path=/home/u9824269/LLM/nemo/databricks-dolly-15k-output.jsonl', 'model.data.validation_ds.add_bos=True', 'model.data.validation_ds.add_eos=True', '+model.data.validation_ds.packed_sequence=True', 'model.tensor_model_parallel_size=8', 'model.pipeline_model_parallel_size=4', 'model.sequence_parallel=False', 'model.activations_checkpoint_granularity=selective', 'model.activations_checkpoint_method=uniform', '+model.context_parallel_size=2', 'exp_manager.create_wandb_logger=True', 'exp_manager.explicit_log_dir=/home/u9824269/LLM/llama3/train/result_dir', 'exp_manager.wandb_logger_kwargs.project=llama3-test', 'exp_manager.wandb_logger_kwargs.name=dolly_sft_run_tp8', 'exp_manager.resume_if_exists=True', 'exp_manager.resume_ignore_no_checkpoint=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True', 'exp_manager.checkpoint_callback_params.monitor=validation_loss']
Traceback (most recent call last):
  File "/home/u9824269/LLM/nemo/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 244, in <module>
    main()
  File "/opt/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/u9824269/LLM/nemo/NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py", line 240, in main
    sft_trainer.fit()
  File "/home/u9824269/LLM/nemo/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 186, in fit
    loss, metrics = self.train_single_step(batch)
  File "/home/u9824269/LLM/nemo/NeMo-Aligner/nemo_aligner/algorithms/supervised.py", line 129, in train_single_step
    loss_mean, metrics = self.model.get_loss_and_metrics(batch=batch, forward_only=False)
  File "/home/u9824269/LLM/nemo/NeMo-Aligner/nemo_aligner/models/nlp/gpt/gpt_sft_model.py", line 92, in get_loss_and_metrics
    losses_reduced = fwd_bwd_function(
  File "/opt/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1211, in forward_backward_pipelining_without_interleaving
    output_tensor = forward_step(
  File "/opt/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 192, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/opt/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 1015, in fwd_output_and_loss_func
    output_tensor = model(**forward_args)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/transformer/module.py", line 168, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/models/gpt/gpt_model.py", line 170, in forward
    hidden_states = self.decoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/transformer/transformer_block.py", line 368, in forward
    hidden_states, context = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 159, in forward
    attention_output_with_bias = self.self_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/transformer/attention.py", line 296, in forward
    core_attn_out = self._checkpointed_attention_forward(
  File "/opt/Megatron-LM/megatron/core/transformer/attention.py", line 130, in _checkpointed_attention_forward
    hidden_states = tensor_parallel.checkpoint(
  File "/opt/Megatron-LM/megatron/core/tensor_parallel/random.py", line 266, in checkpoint
    return CheckpointFunction.apply(function, distribute_saved_activations, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/Megatron-LM/megatron/core/tensor_parallel/random.py", line 205, in forward
    outputs = run_function(*args)
  File "/opt/Megatron-LM/megatron/core/transformer/attention.py", line 117, in custom_forward
    output_ = self.core_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/Megatron-LM/megatron/core/transformer/custom_layers/transformer_engine.py", line 462, in forward
    core_attn_out = super().forward(
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 3339, in forward
    return self.mem_eff_attention(query_layer,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 2551, in forward
    output = attn_forward_func_with_cp(
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 1183, in attn_forward_func_with_cp
    out = AttnFuncWithCP.apply(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 773, in forward
    flash_attn_fwd_softmax_lse_correction(softmax_lse_[..., 1, :],
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 417, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 580, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 741, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 384, in _convert_frame_assert
    return _compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 643, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 246, in time_wrapper
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 524, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 151, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 489, in transform
    tracer.run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 2110, in run
    super().run()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 780, in run
    and self.step()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 743, in step
    getattr(self, inst.opname)(inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 462, in wrapper
    return inner_fn(self, inst)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 1190, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/symbolic_convert.py", line 644, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/torch.py", line 559, in call_function
    tensor_variable = wrap_fx_proxy(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1302, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/builder.py", line 1387, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1590, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1545, in get_fake_value
    ret_val = wrap_fake_exception(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1086, in wrap_fake_exception
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1546, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1657, in run_node
    raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 1636, in run_node
    return node.target(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1480, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_subclasses/fake_tensor.py", line 1738, in dispatch
    r = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 571, in __call__
    return self_._op(*args, **(kwargs or {}))
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 250, in _fn
    result = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_prims_common/wrappers.py", line 137, in _fn
    result = fn(**bound.arguments)
  File "/usr/local/lib/python3.10/dist-packages/torch/_refs/__init__.py", line 1033, in _ref
    a, b = _maybe_broadcast(a, b)
  File "/usr/local/lib/python3.10/dist-packages/torch/_refs/__init__.py", line 413, in _maybe_broadcast
    common_shape = _broadcast_shapes(
  File "/usr/local/lib/python3.10/dist-packages/torch/_refs/__init__.py", line 402, in _broadcast_shapes
    raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in method max of type object at 0x2ac549a5ab60>(*(FakeTensor(..., device='cuda:0', size=(1, 8, 144), dtype=torch.float64), FakeTensor(..., device='cuda:0', size=(1, 8, 160))), **{}):
Attempting to broadcast a dimension of length 160 at -1! Mismatching argument at index 1 had torch.Size([1, 8, 160]); but expected shape should be broadcastable to [1, 8, 144]

from user code:
   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 486, in flash_attn_fwd_softmax_lse_correction
    max_scale = torch.max(softmax_lse, softmax_lse_per_step)

besides, the TransformerEngine I am using has been modified to incorporate xformers to support the V100 GPU. This program has passed the test at https://github.com/NVIDIA/TransformerEngine/blob/main/tests/pytorch/fused_attn/run_fused_attn_with_cp.py, so there should be no issues.

I can't find examples of using context parallel, and any would be very helpful.

gshennvm Apr 22, 2024
Maintainer

hello, i talked to the CP expert here and they told me this seems more like a torch compile issue. To help rule it out can you let me know if llama3 works if you don't use CP at all?

Unfortunately the only docs we have for CP is here: https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/context_parallel.html

alex-ht Apr 23, 2024
Author

Yes, it can run normally without CP.
I will share my code later.

alex-ht Apr 23, 2024
Author

I have placed the log messages and scripts here.
https://github.com/alex-ht/nemo_cp_debug

The first experiment (without CP) encountered an error while saving the checkpoint, and the second experiment (with CP) had an error right at the start of training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enable context parallelism? #150

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to enable context parallelism? #150

alex-ht Apr 11, 2024

Replies: 1 comment · 8 replies

gshennvm Apr 14, 2024 Maintainer

alex-ht Apr 14, 2024 Author

alex-ht Apr 17, 2024 Author

gshennvm Apr 17, 2024 Maintainer

gshennvm Apr 17, 2024 Maintainer

alex-ht Apr 22, 2024 Author

gshennvm Apr 22, 2024 Maintainer

alex-ht Apr 23, 2024 Author

alex-ht Apr 23, 2024 Author

alex-ht
Apr 11, 2024

Replies: 1 comment 8 replies

gshennvm
Apr 14, 2024
Maintainer

alex-ht Apr 14, 2024
Author

alex-ht Apr 17, 2024
Author

gshennvm Apr 17, 2024
Maintainer

gshennvm Apr 17, 2024
Maintainer

alex-ht Apr 22, 2024
Author

gshennvm Apr 22, 2024
Maintainer

alex-ht Apr 23, 2024
Author

alex-ht Apr 23, 2024
Author