Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving checkpoints on network drive fails due to symlinks #2942

Open
eldarkurtic opened this issue Jan 31, 2024 · 1 comment
Open

Saving checkpoints on network drive fails due to symlinks #2942

eldarkurtic opened this issue Jan 31, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@eldarkurtic
Copy link

Hi folks,
I am using llm-foundry to train some LLMs, and trying to save checkpoints directly to network drive (AWS on-prem storage). The issue I am hitting looks like this:

[Errno 524] Unknown error 524: 'ep0-ba2-rank0.pt' -> '/network/eldar/llmfoundry_checkpoints/test_x/latest-rank0.pt'

at the line:

File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 352, in _save_checkpoint
    os.symlink(os.path.relpath(src_path, os.path.dirname(symlink)), symlink)

FYI: saving on a local disk works just fine. I think this is an issue of not being able to create symlinks on the network drive. For example, running: touch test1.txt && ln -s test1.txt test2.txt, results with the same Unknown error 524.

I was wondering whether you have any suggestion on how to bypass this restriction (?) of not being able to create symlinks on network drives. If not, is there a straight-forward way to save checkpoints on the network drive but keep symlinks on local disks.
After digging a bit through the Composer lib, I feel that this could be hacked relatively easy but I'm wondering if you think that might break some other parts of either Composer or llm-foundry.

@eldarkurtic eldarkurtic added the bug Something isn't working label Jan 31, 2024
@mvpatel2000
Copy link
Contributor

You can specify save_latest_filename to keep the symlink on your local disk if that works for you. That seems like the easiest solution.

For object stores, we emulate a symlink by creating a file that has the path to the checkpoint in it's contents. We could try building a similar solution for a network drive -- this seems like the "right" solution. Unfortunately, it's not something we will be able to build since we don't have access to network drives to test this, but I'm happy to work with you and give some guidance if you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants