Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken symlinks for run_dir cause rt.sh to not remove symlink and error out when trying to create new link. #2148

Closed
BrianCurtis-NOAA opened this issue Feb 23, 2024 · 20 comments · Fixed by #2225
Assignees
Labels
bug Something isn't working

Comments

@BrianCurtis-NOAA
Copy link
Collaborator

Description

Broken symlinks for run_dir cause rt.sh to not remove symlink and error out when trying to create new link.

To Reproduce:

  • Run rt.sh
  • remove the actual location of run_dir, do not touch symlink
  • Run rt.sh

Fix?

Try using -L check along with -e to weed out if the symlink is broken.

@BrianCurtis-NOAA BrianCurtis-NOAA added the bug Something isn't working label Feb 23, 2024
@zach1221 zach1221 self-assigned this Mar 7, 2024
@zach1221
Copy link
Collaborator

Hey, @BrianCurtis-NOAA , I tried to replicate this on Hercules. I forced a failure first, so the run_dir directory with symlink was in the test/ directory, then ran another test using ./rt.sh -a epic -e -l rt.conf . The new run seemed to create a new run_dir with symlink , then once completed I see the removal of run_dir, with the symlink successfully deleted afterwards.

Is this the expected sequence of events or am I missing something?

@DeniseWorthen
Copy link
Collaborator

Run a test and then move the actual rt_XX to a different name. Then try to re-run the rt.sh.

@zach1221
Copy link
Collaborator

Ok, I think I replicated the error.

  • mkdir -p /scratch1/NCEPDEV/stmp2/Zachary.Shrader/FV3_RT/rt_3074235
  • [[ -e /scratch1/NCEPDEV/stmp2/Zachary.Shrader/ufs-weather-model/tests/run_dir ]]
  • echo 'Linking /scratch1/NCEPDEV/stmp2/Zachary.Shrader/FV3_RT/rt_3074235 to /scratch1/NCEPDEV/stmp2/Zachary.Shrader/ufs-weather-model/tests/run_dir'
    Linking /scratch1/NCEPDEV/stmp2/Zachary.Shrader/FV3_RT/rt_3074235 to /scratch1/NCEPDEV/stmp2/Zachary.Shrader/ufs-weather-model/tests/run_dir
  • ln -s /scratch1/NCEPDEV/stmp2/Zachary.Shrader/FV3_RT/rt_3074235 /scratch1/NCEPDEV/stmp2/Zachary.Shrader/ufs-weather-model/tests/run_dir
    ln: failed to create symbolic link '/scratch1/NCEPDEV/stmp2/Zachary.Shrader/ufs-weather-model/tests/run_dir': File exists
    ++ echo 'rt.sh error on line 908'

@DusanJovic-NOAA
Copy link
Collaborator

I still see this error:

+ cp fv3_conf/fv3_slurm.IN_hercules fv3_conf/fv3_slurm.IN
+ cp fv3_conf/compile_slurm.IN_hercules fv3_conf/compile_slurm.IN
+ eval 'set +x'
++ set +x
Linking /work2/noaa/stmp/djovic/stmp/djovic/FV3_RT/rt_96898 to /work/noaa/fv3-cam/djovic/ufs/e861/ufs-weather-model/tests/run_dir
ln: failed to create symbolic link '/work/noaa/fv3-cam/djovic/ufs/e861/ufs-weather-model/tests/run_dir': File exists
rt.sh: Getting error information...
Exited at line 1047 having code 1
rt.sh: Exited abnormally, killing workflow and cleaning up
rt.sh: Cleaning up...
rt.sh: Exiting.

What's the purpose of this run_dir? We didn't have that directory until recently and everything worked just fine.

@BrianCurtis-NOAA
Copy link
Collaborator Author

BrianCurtis-NOAA commented Apr 18, 2024

It's a link in the ufs-weather-model/tests dir directly to the FV3_RT/rt_##### dir

The rt.sh is supposed to remove that and create a new one on each call to rt.sh

ufs-weather-model/tests/rt.sh

Lines 1043 to 1048 in 47c0099

if [[ -L "${PATHRT}/run_dir" && -d "${PATHRT}/run_dir" ]]; then
rm "${PATHRT}/run_dir"
fi
echo "Linking ${RUNDIR_ROOT} to ${PATHRT}/run_dir"
ln -s "${RUNDIR_ROOT}" "${PATHRT}/run_dir"
echo "Run regression test in: ${RUNDIR_ROOT}"

Any idea why it didn't seem to.

@DusanJovic-NOAA
Copy link
Collaborator

I see it's a link to rt_####### directory but why it is there, why do we need it, how is it used?

@BrianCurtis-NOAA
Copy link
Collaborator Author

I see it's a link to rt_####### directory but why it is there, why do we need it, how is it used?

It's here to provide dev's a direct route to all test outputs (err, out etc..).

It's convenience, IMO. Why shouldn't we have it, why should we be concerned it's there? If it's because we can error out rt.sh, then I can fix the code once we figure out why it didn't remove the run_dir. Maybe the combination of -L and -d are not quite right?

@DusanJovic-NOAA
Copy link
Collaborator

Then why, for dev's convenience, we do not have a direct route to the baseline directory or input data directory or spack module files used etc. It's not necessary, it complicates scripts, it introduces potential errors. Like this one for example.

@BrianCurtis-NOAA
Copy link
Collaborator Author

Then why, for dev's convenience, we do not have a direct route to the baseline directory or input data directory or spack module files used etc. It's not necessary, it complicates scripts, it introduces potential errors. Like this one for example.

the rt_###### dir changes each run, and there's tons of information there when things fail. There's no need for a dev to be in any of those other places you mention when things fail. I disagree that we should ignore a convenient tool because we're worried it might cause an error now and then.

@jkbk2004
Copy link
Collaborator

Then why, for dev's convenience, we do not have a direct route to the baseline directory or input data directory or spack module files used etc. It's not necessary, it complicates scripts, it introduces potential errors. Like this one for example.

the rt_###### dir changes each run, and there's tons of information there when things fail. There's no need for a dev to be in any of those other places you mention when things fail. I disagree that we should ignore a convenient tool because we're worried it might cause an error now and then.

We can add a python script to safely handle directory sym link issue:

if(os.path.isdir(targetLink)):
    os.rmdir(targetLink)
else:
    os.unlink(targetLink)

@DusanJovic-NOAA
Copy link
Collaborator

Why do we need python to remove a symbolic link? How did people remove symbolic links before there was python.

Simple rm will do the job:

rm -rf "${PATHRT}/run_dir"
ln -s "${RUNDIR_ROOT}" "${PATHRT}/run_dir"

Please do not complicate things even further.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@zach1221 We can go with Dusan's solution. Using rm -rf is generally unsafe, with symlinks it won't follow into those so it shouldn't pose a problem here.

You will want to remove the if sections of the code and just keep Dusan's part.

@DusanJovic-NOAA
Copy link
Collaborator

Why is rm -rf unsafe?

@BrianCurtis-NOAA
Copy link
Collaborator Author

Its necessary in code to use -rf like this, but in general I don't like using rm -rf with env vars in case the env var mysteriously gets changed before reaching that point and a directory gets removed on accident.

@jkbk2004
Copy link
Collaborator

Why is rm -rf unsafe?

@BrianCurtis-NOAA right, there is a chance to delete source directory with -rf when target link is deleted.

@BrianCurtis-NOAA
Copy link
Collaborator Author

If it matters, the reason I had the if statements was for just this reason. It's safe to use just rm because there's little risk if an env var gets changed, nothing resursive is done. I was trying to verify that it was a directory and a symlink before trying to use rm to be extra safe.

@zach1221
Copy link
Collaborator

So should I still add it? @BrianCurtis-NOAA you mean remove the below from lines 1044 - 1046?
if [[ -L "${PATHRT}/run_dir" && -d "${PATHRT}/run_dir" ]]; then rm "${PATHRT}/run_dir" fi

@BrianCurtis-NOAA
Copy link
Collaborator Author

Yeah, no need to keep that if/fi section.

@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Apr 19, 2024

Why is rm -rf unsafe?

@BrianCurtis-NOAA right, there is a chance to delete source directory with -rf when target link is deleted.

I do not understand this. Are you saying that this line:

rm -rf "${PATHRT}/run_dir"

can remove source directory? Can you give exact sequence of commands how this can happen.

@zach1221
Copy link
Collaborator

Recommended workaround to this issue was committed with WM PR #2213 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
5 participants