Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falied to acquire lock on file #183

Closed
huohuoxi opened this issue Jul 30, 2024 · 1 comment
Closed

Falied to acquire lock on file #183

huohuoxi opened this issue Jul 30, 2024 · 1 comment

Comments

@huohuoxi
Copy link

It is from
google/orbax#523 (comment)
I have the same problem with this man.
He says,
Hi, I am having some trouble saving a checkpoint. My code is very simple:

PyTreeCheckpointer().save(dir_ / 'actor_params', controller._actor_params)

and works fine on my local machine. However, when running on an HPC cluster, I get the following error:

ValueError: FAILED_PRECONDITION: Error writing local file "/home/oeberhard/laurel/dat/runs/test/working_directories/1/actor_params.orbax-checkpoint-tmp-1695894959581296/_sharding": Failed to acquire lock on file: /home/oeberhard/laurel/dat/runs/test/working_directories/1/actor_params.orbax-checkpoint-tmp-1695894959581296/_sharding.__lock [OS error: Function not implemented] [source locations='tensorstore/kvstore/file/file_key_value_store.cc:676\ntensorstore/kvstore/kvstore.cc:268']

There are no access problems i.e. checkpointing with flax.training.checkpoints works perfectly. Is there maybe a way to disable the locking? Thanks!

copybara-service bot pushed a commit that referenced this issue Aug 1, 2024
Update file locking:
On Linux, if ::fcntl(F_OFD_SETLKW) fails with ENOSYS/ENOTSUP, fallback to ::flock.
Replace FileLockTraits with the AcquireFdLock function, which returns a function pointer used to release the lock.
Better handling of errno when releasing a file lock, which was just wrong before.
Improve comments.

This may aid locking issues with some network filesystems.
#183

PiperOrigin-RevId: 658512121
Change-Id: Ie471e01b039ad108f9906813b86cd1b1d722be45
@laramiel
Copy link
Collaborator

laramiel commented Aug 1, 2024

I added a fallback to ::flock locking in 5847d9a

This is released in tensorstore v0.1.64.

Please try it out and indicate whether it works on your network filesystem, as I don't have an equivalent system to test it on.

@jbms jbms closed this as completed Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants