-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when I use nvme to offload param and optimizer I meet a bug[BUG] #3376
Comments
@etoilestar, can you share more details to repro this issue? In the meantime, can you confirm that |
Yes, I emptied this folder before I ran the code, what kind of information should I provide, can you give me a hint? |
I use 8 3090 graphics cards, and the code I execute is deepspeed_megatron to train gpt3. When I increase the buffer_count, this error will disappear, but it will freeze during the preprocessing process. |
Describe the bugHi @tjruwase, really thanks for join us. I have totally same problem with @etoilestar.Please let me give some more details. To Reproduce
Then, "buffer nbytes != file xxx" occurred. System info (please complete the following information):
Thanks! |
@ReyRen, could you please share your log as well? |
@etoilestar and @ReyRen, I am trying to repro this issue. I am using a 4xV100-16GB which is probably different from your setups. |
hello,thanks for your reply, he get the same log as me, here is my log: **[2023-05-04 01:51:44,732] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
|
Also,sometimes, the process freezes when running the same script. |
@etoilestar, thanks for sharing your log. Can you please do the following:
|
hello, could you tell me how to get the stack trace? the size of this file is 0, and I just want to use disk to train a larger model. thanks |
The stack trace should be printed alongside error message and shows the code path leading to the failure. A file size of 0 means the previous file write (creation) failed. Can you try running with a smaller model as suggested? |
yes, when I reduce the number of layers to 8, the program runs normally. |
In that case, I am curious whether failure is filesystem problem, such as running out of disk space. How large is the offload folder? |
it is around 10T, I guess this bug is caused by the nvme is not as fast as expected. |
Ideally, nvme speed should affect throughput but not cause failures. If you would like to continue this investigation can you please do the following?
|
Okay, I will try again later. |
there is another situation,when I increase buffer_count from 4 to 96, the size of .swp file is not zero, yet the the process freezes. |
Maybe you can take it into consideration. |
hello,it seems that you did not finish vit model with PP/TP in https://github.com/microsoft/Megatron-DeepSpeed, I recently tried to write this code, can you give me some advice? |
@etoilestar, apologies for the silence. Are you still interested in this issue? Thanks! |
thank you, I focus on another part of your project, I will close this issue. |
python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertion
static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed. /nvme/zero_stage_3/optimizer/rank6/139649992513552.tensor.swp: buffer nbytes != file bytes 4001366016 != 3426746368 python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertion
static_cast(buffer.nbytes()) == num_file_bytes' failed./nvme/zero_stage_3/optimizer/rank2/139929715382368.tensor.swp: buffer nbytes != file bytes 4001366016 != 3599761408
/nvme/zero_stage_3/optimizer/rank1/140296723433568.tensor.swp: buffer nbytes != file bytes python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertion `static_cast(buffer.nbytes()) == num_file_bytes' failed.
4001366016 != 3539992576
The text was updated successfully, but these errors were encountered: