Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket Timeout after 30 minutes running Issue #809

Closed
6 of 8 tasks
unknown-submitter-000 opened this issue Nov 1, 2023 · 10 comments
Closed
6 of 8 tasks

Socket Timeout after 30 minutes running Issue #809

unknown-submitter-000 opened this issue Nov 1, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@unknown-submitter-000
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Situation: I am doing a fine-tuning task on llama-7b model with qLoRA. When the dataset size is increasing from 10k to 20k, the program always goes into a socket issue. I have already set the ddp_timeout larger enough.

Current behaviour

Map (num_proc=64): 57%|█████▋ | 76555/135065 [29:59<25:28, 38.28 examples/s]
Map (num_proc=64): 57%|█████▋ | 76562/135065 [29:59<21:20, 45.68 examples/s]
Map (num_proc=64): 57%|█████▋ | 76572/135065 [29:59<17:38, 55.24 examples/s]
Map (num_proc=64): 57%|█████▋ | 76578/135065 [29:59<18:34, 52.50 examples/s]
Map (num_proc=64): 57%|█████▋ | 76584/135065 [29:59<25:07, 38.79 examples/s]
Map (num_proc=64): 57%|█████▋ | 76589/135065 [29:59<25:47, 37.78 examples/s]
Map (num_proc=64): 57%|█████▋ | 76595/135065 [29:59<23:49, 40.90 examples/s]
Map (num_proc=64): 57%|█████▋ | 76602/135065 [30:00<21:21, 45.61 examples/s]
Map (num_proc=64): 57%|█████▋ | 76608/135065 [30:00<20:04, 48.54 examples/s]
Map (num_proc=64): 57%|█████▋ | 76614/135065 [30:00<24:54, 39.10 examples/s]
Map (num_proc=64): 57%|█████▋ | 76619/135065 [30:00<25:23, 38.37 examples/s]
Map (num_proc=64): 57%|█████▋ | 76626/135065 [30:00<22:17, 43.69 examples/s]
Map (num_proc=64): 57%|█████▋ | 76634/135065 [30:00<19:49, 49.14 examples/s]
Map (num_proc=64): 57%|█████▋ | 76640/135065 [30:00<21:09, 46.04 examples/s]
Map (num_proc=64): 57%|█████▋ | 76645/135065 [30:01<22:39, 42.98 examples/s]
Map (num_proc=64): 57%|█████▋ | 76650/135065 [30:01<24:14, 40.15 examples/s]
Map (num_proc=64): 57%|█████▋ | 76659/135065 [30:01<19:04, 51.05 examples/s]
Map (num_proc=64): 57%|█████▋ | 76665/135065 [30:01<19:50, 49.08 examples/s]
Map (num_proc=64): 57%|█████▋ | 76671/135065 [30:01<19:21, 50.29 examples/s]
Map (num_proc=64): 57%|█████▋ | 76677/135065 [30:01<25:21, 38.37 examples/s]
Map (num_proc=64): 57%|█████▋ | 76682/135065 [30:01<23:51, 40.77 examples/s]
Map (num_proc=64): 57%|█████▋ | 76687/135065 [30:02<24:25, 39.83 examples/s]
Map (num_proc=64): 57%|█████▋ | 76695/135065 [30:02<21:09, 46.00 examples/s]
Map (num_proc=64): 57%|█████▋ | 76701/135065 [30:02<21:05, 46.13 examples/s]
Map (num_proc=64): 57%|█████▋ | 76706/135065 [30:02<21:06, 46.07 examples/s]
Map (num_proc=64): 57%|█████▋ | 76711/135065 [30:02<21:36, 45.00 examples/s]
Map (num_proc=64): 57%|█████▋ | 76716/135065 [30:02<22:41, 42.85 examples/s]
Map (num_proc=64): 57%|█████▋ | 76722/135065 [30:02<22:22, 43.47 examples/s]
Map (num_proc=64): 57%|█████▋ | 76727/135065 [30:02<21:53, 44.40 examples/s]
Map (num_proc=64): 57%|█████▋ | 76733/135065 [30:03<21:06, 46.05 examples/s]
Map (num_proc=64): 57%|█████▋ | 76738/135065 [30:03<20:49, 46.68 examples/s]
Map (num_proc=64): 57%|█████▋ | 76745/135065 [30:03<23:24, 41.52 examples/s]
Map (num_proc=64): 57%|█████▋ | 76754/135065 [30:03<21:05, 46.09 examples/s]
Map (num_proc=64): 57%|█████▋ | 76761/135065 [30:03<19:50, 48.96 examples/s]
Map (num_proc=64): 57%|█████▋ | 76766/135065 [30:03<24:21, 39.88 examples/s]
Map (num_proc=64): 57%|█████▋ | 76773/135065 [30:03<21:44, 44.67 examples/s]
Map (num_proc=64): 57%|█████▋ | 76778/135065 [30:04<23:13, 41.81 examples/s]
Map (num_proc=64): 57%|█████▋ | 76783/135065 [30:04<24:39, 39.39 examples/s]
Map (num_proc=64): 57%|█████▋ | 76789/135065 [30:04<22:31, 43.13 examples/s]
Map (num_proc=64): 57%|█████▋ | 76794/135065 [30:04<22:16, 43.60 examples/s]
Map (num_proc=64): 57%|█████▋ | 76800/135065 [30:04<21:51, 44.42 examples/s]Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 52, in
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 52, in
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 52, in
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 52, in
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 52, in
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 52, in
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 52, in
fire.Fire(do_cli)fire.Fire(do_cli)fire.Fire(do_cli)fire.Fire(do_cli)fire.Fire(do_cli)fire.Fire(do_cli)

fire.Fire(do_cli)

File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)

File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(
component, remaining_args = _CallAndUpdateTrace(

File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)component = fn(*varargs, **kwargs)component = fn(*varargs, **kwargs)component = fn(*varargs, **kwargs)component = fn(*varargs, **kwargs)
component = fn(*varargs, **kwargs)
component = fn(*varargs, **kwargs)

File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 47, in do_cli
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 47, in do_cli
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 47, in do_cli
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 47, in do_cli
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 47, in do_cli
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 47, in do_cli
File "/home/ec2-user/proj/code/axolotl/scripts/finetune.py", line 47, in do_cli
dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args) dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)
dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)dataset_meta = load_datasets(cfg=parsed_cfg, cli_args=parsed_cli_args)

File "/home/ec2-user/proj/code/axolotl/src/axolotl/cli/init.py", line 225, in load_datasets
File "/home/ec2-user/proj/code/axolotl/src/axolotl/cli/init.py", line 225, in load_datasets
File "/home/ec2-user/proj/code/axolotl/src/axolotl/cli/init.py", line 225, in load_datasets
File "/home/ec2-user/proj/code/axolotl/src/axolotl/cli/init.py", line 225, in load_datasets
File "/home/ec2-user/proj/code/axolotl/src/axolotl/cli/init.py", line 225, in load_datasets
File "/home/ec2-user/proj/code/axolotl/src/axolotl/cli/init.py", line 225, in load_datasets
File "/home/ec2-user/proj/code/axolotl/src/axolotl/cli/init.py", line 225, in load_datasets
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset( train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset( train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(

train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(

File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/data.py", line 61, in prepare_dataset

File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/data.py", line 61, in prepare_dataset
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/data.py", line 61, in prepare_dataset
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/data.py", line 61, in prepare_dataset
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/data.py", line 61, in prepare_dataset
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/data.py", line 61, in prepare_dataset
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/data.py", line 61, in prepare_dataset
with zero_first(is_main_process()):with zero_first(is_main_process()):

with zero_first(is_main_process()):with zero_first(is_main_process()):with zero_first(is_main_process()):with zero_first(is_main_process()):with zero_first(is_main_process()):

File "/opt/conda/envs/pytorch/lib/python3.10/contextlib.py", line 135, in enter

File "/opt/conda/envs/pytorch/lib/python3.10/contextlib.py", line 135, in enter

File "/opt/conda/envs/pytorch/lib/python3.10/contextlib.py", line 135, in enter
File "/opt/conda/envs/pytorch/lib/python3.10/contextlib.py", line 135, in enter
File "/opt/conda/envs/pytorch/lib/python3.10/contextlib.py", line 135, in enter
File "/opt/conda/envs/pytorch/lib/python3.10/contextlib.py", line 135, in enter
File "/opt/conda/envs/pytorch/lib/python3.10/contextlib.py", line 135, in enter
return next(self.gen)return next(self.gen)return next(self.gen)return next(self.gen)return next(self.gen)return next(self.gen)

return next(self.gen)

File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 63, in zero_first
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 63, in zero_first
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 63, in zero_first
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 63, in zero_first
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 63, in zero_first
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 63, in zero_first
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 63, in zero_first
barrier()barrier()barrier()barrier()barrier()barrier()
barrier()

File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 40, in barrier
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 40, in barrier
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 40, in barrier
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 40, in barrier
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 40, in barrier
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 40, in barrier
File "/home/ec2-user/proj/code/axolotl/src/axolotl/utils/distributed.py", line 40, in barrier
dist.barrier()dist.barrier()dist.barrier()dist.barrier()dist.barrier()dist.barrier()

dist.barrier()

File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)

work = default_pg.barrier(opts=opts)

RuntimeError
RuntimeErrorRuntimeErrorRuntimeErrorRuntimeError: RuntimeError: : : : RuntimeError: : [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1691816824431/work/torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f735df914d7 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f735df5b434 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f7399840da8 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f7399841a52 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f7399841ad9 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7399800fa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7399800fa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7399800fa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7399800fa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7f735ef8b93f in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x201 (0x7f735ef8f5f1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0xf3b71d (0x7f735ef9671d in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f735ef97b21 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x39d (0x7f735ef9a7dd in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x851 (0x7f735efa9731 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #15: + 0x4def7b9 (0x7f73997f57b9 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x4df350a (0x7f73997f950a in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x4e02370 (0x7f7399808370 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: + 0xb6933e (0x7f73a11e433e in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0x3b6f75 (0x7f73a0a31f75 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #20: + 0x144516 (0x5606178b5516 in /opt/conda/envs/pytorch/bin/python)
frame #21: _PyObject_MakeTpCall + 0x26b (0x5606178aea6b in /opt/conda/envs/pytorch/bin/python)
frame #22: + 0x1507d6 (0x5606178c17d6 in /opt/conda/envs/pytorch/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x13ca (0x5606178a68fa in /opt/conda/envs/pytorch/bin/python)
frame #24: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x4c12 (0x5606178aa142 in /opt/conda/envs/pytorch/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x320 (0x5606178a5850 in /opt/conda/envs/pytorch/bin/python)
frame #28: + 0x1b5370 (0x560617926370 in /opt/conda/envs/pytorch/bin/python)
frame #29: + 0x144b63 (0x5606178b5b63 in /opt/conda/envs/pytorch/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x320 (0x5606178a5850 in /opt/conda/envs/pytorch/bin/python)
frame #31: + 0x150774 (0x5606178c1774 in /opt/conda/envs/pytorch/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x28e7 (0x5606178a7e17 in /opt/conda/envs/pytorch/bin/python)
frame #33: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x320 (0x5606178a5850 in /opt/conda/envs/pytorch/bin/python)
frame #35: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x13ca (0x5606178a68fa in /opt/conda/envs/pytorch/bin/python)
frame #37: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x2d80 (0x5606178a82b0 in /opt/conda/envs/pytorch/bin/python)
frame #39: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x13ca (0x5606178a68fa in /opt/conda/envs/pytorch/bin/python)
frame #41: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x320 (0x5606178a5850 in /opt/conda/envs/pytorch/bin/python)
frame #43: _PyFunction_Vectorcall + 0x6c (0x5606178b599c in /opt/conda/envs/pytorch/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x4c12 (0x5606178aa142 in /opt/conda/envs/pytorch/bin/python)
frame #45: + 0x1d7f90 (0x560617948f90 in /opt/conda/envs/pytorch/bin/python)
frame #46: PyEval_EvalCode + 0x87 (0x560617948ed7 in /opt/conda/envs/pytorch/bin/python)
frame #47: + 0x20842a (0x56061797942a in /opt/conda/envs/pytorch/bin/python)
frame #48: + 0x203833 (0x560617974833 in /opt/conda/envs/pytorch/bin/python)
frame #49: + 0x9a6cd (0x56061780b6cd in /opt/conda/envs/pytorch/bin/python)
frame #50: _PyRun_SimpleFileObject + 0x1ae (0x56061796ed1e in /opt/conda/envs/pytorch/bin/python)
frame #51: _PyRun_AnyFileObject + 0x44 (0x56061796e8b4 in /opt/conda/envs/pytorch/bin/python)
frame #52: Py_RunMain + 0x38b (0x56061796baab in /opt/conda/envs/pytorch/bin/python)
frame #53: Py_BytesMain + 0x37 (0x56061793c527 in /opt/conda/envs/pytorch/bin/python)
frame #54: __libc_start_main + 0xea (0x7f73f2dd713a in /lib64/libc.so.6)
frame #55: + 0x1cb421 (0x56061793c421 in /opt/conda/envs/pytorch/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.[4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1691816824431/work/torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fddcd10c4d7 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fddcd0d6434 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7fde089bbda8 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7fde089bca52 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7fde089bcad9 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fde0897bfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fde0897bfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fde0897bfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fde0897bfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7fddce10693f in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x201 (0x7fddce10a5f1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0xf3b71d (0x7fddce11171d in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7fddce112b21 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x39d (0x7fddce1157dd in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x851 (0x7fddce124731 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #15: + 0x4def7b9 (0x7fde089707b9 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x4df350a (0x7fde0897450a in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x4e02370 (0x7fde08983370 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: + 0xb6933e (0x7fde1035f33e in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0x3b6f75 (0x7fde0fbacf75 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #20: + 0x144516 (0x5637dfda9516 in /opt/conda/envs/pytorch/bin/python)
frame #21: _PyObject_MakeTpCall + 0x26b (0x5637dfda2a6b in /opt/conda/envs/pytorch/bin/python)
frame #22: + 0x1507d6 (0x5637dfdb57d6 in /opt/conda/envs/pytorch/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x13ca (0x5637dfd9a8fa in /opt/conda/envs/pytorch/bin/python)
frame #24: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x4c12 (0x5637dfd9e142 in /opt/conda/envs/pytorch/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x320 (0x5637dfd99850 in /opt/conda/envs/pytorch/bin/python)
frame #28: + 0x1b5370 (0x5637dfe1a370 in /opt/conda/envs/pytorch/bin/python)
frame #29: + 0x144b63 (0x5637dfda9b63 in /opt/conda/envs/pytorch/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x320 (0x5637dfd99850 in /opt/conda/envs/pytorch/bin/python)
frame #31: + 0x150774 (0x5637dfdb5774 in /opt/conda/envs/pytorch/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x28e7 (0x5637dfd9be17 in /opt/conda/envs/pytorch/bin/python)
frame #33: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x320 (0x5637dfd99850 in /opt/conda/envs/pytorch/bin/python)
frame #35: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x13ca (0x5637dfd9a8fa in /opt/conda/envs/pytorch/bin/python)
frame #37: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x2d80 (0x5637dfd9c2b0 in /opt/conda/envs/pytorch/bin/python)
frame #39: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x13ca (0x5637dfd9a8fa in /opt/conda/envs/pytorch/bin/python)
frame #41: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x320 (0x5637dfd99850 in /opt/conda/envs/pytorch/bin/python)
frame #43: _PyFunction_Vectorcall + 0x6c (0x5637dfda999c in /opt/conda/envs/pytorch/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x4c12 (0x5637dfd9e142 in /opt/conda/envs/pytorch/bin/python)
frame #45: + 0x1d7f90 (0x5637dfe3cf90 in /opt/conda/envs/pytorch/bin/python)
frame #46: PyEval_EvalCode + 0x87 (0x5637dfe3ced7 in /opt/conda/envs/pytorch/bin/python)
frame #47: + 0x20842a (0x5637dfe6d42a in /opt/conda/envs/pytorch/bin/python)
frame #48: + 0x203833 (0x5637dfe68833 in /opt/conda/envs/pytorch/bin/python)
frame #49: + 0x9a6cd (0x5637dfcff6cd in /opt/conda/envs/pytorch/bin/python)
frame #50: _PyRun_SimpleFileObject + 0x1ae (0x5637dfe62d1e in /opt/conda/envs/pytorch/bin/python)
frame #51: _PyRun_AnyFileObject + 0x44 (0x5637dfe628b4 in /opt/conda/envs/pytorch/bin/python)
frame #52: Py_RunMain + 0x38b (0x5637dfe5faab in /opt/conda/envs/pytorch/bin/python)
frame #53: Py_BytesMain + 0x37 (0x5637dfe30527 in /opt/conda/envs/pytorch/bin/python)
frame #54: __libc_start_main + 0xea (0x7fde61f5213a in /lib64/libc.so.6)
frame #55: + 0x1cb421 (0x5637dfe30421 in /opt/conda/envs/pytorch/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.[7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1691816824431/work/torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb23f5fd4d7 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7fb23f5c7434 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7fb27aeacda8 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7fb27aeada52 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7fb27aeadad9 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb27ae6cfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb27ae6cfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb27ae6cfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fb27ae6cfa1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7fb2405f793f in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x201 (0x7fb2405fb5f1 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0xf3b71d (0x7fb24060271d in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7fb240603b21 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x39d (0x7fb2406067dd in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #14: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0x851 (0x7fb240615731 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #15: + 0x4def7b9 (0x7fb27ae617b9 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: + 0x4df350a (0x7fb27ae6550a in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: + 0x4e02370 (0x7fb27ae74370 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: + 0xb6933e (0x7fb28285033e in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0x3b6f75 (0x7fb28209df75 in /opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #20: + 0x144516 (0x5556c1d7c516 in /opt/conda/envs/pytorch/bin/python)
frame #21: _PyObject_MakeTpCall + 0x26b (0x5556c1d75a6b in /opt/conda/envs/pytorch/bin/python)
frame #22: + 0x1507d6 (0x5556c1d887d6 in /opt/conda/envs/pytorch/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x13ca (0x5556c1d6d8fa in /opt/conda/envs/pytorch/bin/python)
frame #24: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x4c12 (0x5556c1d71142 in /opt/conda/envs/pytorch/bin/python)
frame #26: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x320 (0x5556c1d6c850 in /opt/conda/envs/pytorch/bin/python)
frame #28: + 0x1b5370 (0x5556c1ded370 in /opt/conda/envs/pytorch/bin/python)
frame #29: + 0x144b63 (0x5556c1d7cb63 in /opt/conda/envs/pytorch/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x320 (0x5556c1d6c850 in /opt/conda/envs/pytorch/bin/python)
frame #31: + 0x150774 (0x5556c1d88774 in /opt/conda/envs/pytorch/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x28e7 (0x5556c1d6ee17 in /opt/conda/envs/pytorch/bin/python)
frame #33: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x320 (0x5556c1d6c850 in /opt/conda/envs/pytorch/bin/python)
frame #35: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x13ca (0x5556c1d6d8fa in /opt/conda/envs/pytorch/bin/python)
frame #37: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x2d80 (0x5556c1d6f2b0 in /opt/conda/envs/pytorch/bin/python)
frame #39: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x13ca (0x5556c1d6d8fa in /opt/conda/envs/pytorch/bin/python)
frame #41: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x320 (0x5556c1d6c850 in /opt/conda/envs/pytorch/bin/python)
frame #43: _PyFunction_Vectorcall + 0x6c (0x5556c1d7c99c in /opt/conda/envs/pytorch/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x4c12 (0x5556c1d71142 in /opt/conda/envs/pytorch/bin/python)
frame #45: + 0x1d7f90 (0x5556c1e0ff90 in /opt/conda/envs/pytorch/bin/python)
frame #46: PyEval_EvalCode + 0x87 (0x5556c1e0fed7 in /opt/conda/envs/pytorch/bin/python)
frame #47: + 0x20842a (0x5556c1e4042a in /opt/conda/envs/pytorch/bin/python)
frame #48: + 0x203833 (0x5556c1e3b833 in /opt/conda/envs/pytorch/bin/python)
frame #49: + 0x9a6cd (0x5556c1cd26cd in /opt/conda/envs/pytorch/bin/python)
frame #50: _PyRun_SimpleFileObject + 0x1ae (0x5556c1e35d1e in /opt/conda/envs/pytorch/bin/python)
frame #51: _PyRun_AnyFileObject + 0x44 (0x5556c1e358b4 in /opt/conda/envs/pytorch/bin/python)
frame #52: Py_RunMain + 0x38b (0x5556c1e32aab in /opt/conda/envs/pytorch/bin/python)
frame #53: Py_BytesMain + 0x37 (0x5556c1e03527 in /opt/conda/envs/pytorch/bin/python)
frame #54: __libc_start_main + 0xea (0x7fb2d444313a in /lib64/libc.so.6)

Steps to reproduce

Use a large enough custom training dataset and it will happen. It always timeout around 30 minutes

Config yaml

base_model: /home/ec2-user/proj/llm_models/vicuna-13b-v1.5-16k
base_model_config: /home/ec2-user/proj/llm_models/vicuna-13b-v1.5-16k
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

  • path: /home/ec2-user/proj/code/llm_long_context/qa_retrieval_ft_data/ft-nq-open-30_total_documents_gold_at_0/potential_20_sample_5_25000fewshot.jsonl
    type: alpaca
    dataset_prepared_path:
    val_set_size: 0.01
    output_dir: ./qa-potential_20_sample_5_25000fewshot/qlora-out-vicuna-13b-v1.5-16k

adapter: qlora
lora_model_dir:

sequence_len: 16384
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 2
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

ddp_timeout: 36000000

warmup_steps: 10
eval_steps: 20
eval_table_size:
save_steps:
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: ""
eos_token: "
"
unk_token: ""

Possible solution

No idea.

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10.12

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@unknown-submitter-000 unknown-submitter-000 added the bug Something isn't working label Nov 1, 2023
@NanoCode012
Copy link
Collaborator

It's weird to get timeout with only 20k samples. Could you try tokenize on GPU0 only then train?

CUDA_VISIBLE_DEVICES=0 python -m axolotl.cli.preprocess your_config.yml
accelerate launch -m axolotl.cli.train  your_config.yml

@unknown-submitter-000
Copy link
Author

Thank you for the suggestion.

It actually has 135k samples in total, the key issue is that the context length is 16k. So the mapping time can be very long.

I tried the suggestion, it seems the preprocessing process can be conducted normally. However, the same issue still happens when the training process starts after 30mins.

@NanoCode012
Copy link
Collaborator

it seems the preprocessing process can be conducted normally. However, the same issue still happens when the training process starts after 30mins.

Could you please retry while setting dataset_prepared_path: to a folder? for ex:

dataset_prepared_path: last_run_prepared

This will allow you to reuse that processed dataset.

@unknown-submitter-000
Copy link
Author

Thank you. I think by adding the dataset_prepared_path works!

@gordicaleksa
Copy link
Contributor

gordicaleksa commented Nov 19, 2023

I'm hitting this same issue, dataset_prepared_path doesn't help.

I tried hardcoding timeout in init_process_group of torch distributed to change it from default 1800 secs to 3600 but that doesn't work I suspect this is some weird interaction between axolotl & accelerate.

Are there any proper solutions as opposed to, make your run a bit faster so that you are still under the 30 min timeout?

@NanoCode012
Copy link
Collaborator

@gordicaleksa Did you try to tokenize on one GPU before running in multiple gpus?

@jinwonkim93
Copy link
Contributor

i have same issue.

@gordicaleksa
Copy link
Contributor

@NanoCode012 I did manage to tokenize it, but now the filters take longer because I have a big dataset - I just need to disable the timeout logic - I don't see a point of the timeout logic, If I think my run is stuck I'll just stop it myself.

@eryk-mazus
Copy link

same issue, I have 4.5 million rows and with smaller context (2,048 tokens) it wasn't a problem for some reason. When I increased the context length to 8,196 it suddenly became an issue

@winglian
Copy link
Collaborator

@eryk-mazus could you open a new issue with details? Please include your YAML, commands you used, and gpu configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants