Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore error checking in regression test system. (Combined PR#2357 and PR#2265) #2335

Merged
merged 39 commits into from
Jul 18, 2024

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Jun 21, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • N/A No subcomponents. All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

The regression test system ignores all errors in all jobs. A job where fv3.exe crashed is the same as a job where it ran, and produced different results. Also, compilation job errors are ignored. This leads to several problems:

  1. Test jobs are executed even if the compile job fails. (This bug is restricted to the Rocoto-based system.)
  2. Jobs with prerequisites (such as restart tests) run even if their prerequisite fails.
  3. A job that failed to copy input data or had syntax errors won't be caught until the entire workflow completes.
  4. Temporary system issues require rerunning the entire workflow instead of only affected jobs.

In this new version of the regression test system:

  1. Errors are caught, and result in the metascheduler considering the job as failed.
  2. Dependencies are honored; if a job fails, anything that depends on it won't run.
  3. Jobs that run to completion, but have changed results are considered to have succeeded. This behavior is unchanged.

New Self-Test System

A new tests/error-test.conf has several jobs to test whether the rt.sh fails and succeeds properly. This is a semi-automated self-test. It contains:

  1. fail_to_copy - This test will fail in run_test.sh before running the job_card. That means ecflow will be unable to submit the job, and other metaschedulers will see the job start and abort with exit status 1
  2. fail_to_run - A test that fails inside the job_card. All metaschedulers should see the job abort with exit status 1.
  3. fail_to_compile - A compile job that will always fail to compile.
  4. dependency_unmet - depends on the fail_to_compile, so it should not be submitted.
  5. atm_dyn32 and control_c48.v2.sfc - These should succeed and match the baselines.
  6. atm_dyn64 and control_c48 - Should succeed, but not match the baselines, because they use 64-bit dynamics instead of 32-bit.

All tests are variations on the control_c48 since it's one of the cheapest and oldest tests in rt.conf, but is unlikely to go away. (The control_c48 tests a core functionality: super-low-res GFS.)

Commit Message:

* UFSWM - restore error checking to regression test system and add a self-test suite

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

N/A

UFSWM Blocking Dependencies:

This branch correctly detects failing tests. That means the regression tests will keep failing until this bug fix is merged:


Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@SamuelTrahanNOAA
Copy link
Collaborator Author

Pinging @DeniseWorthen who authored the relevant issue. Also, @DusanJovic-NOAA who authored the original regression test system.

@SamuelTrahanNOAA
Copy link
Collaborator Author

SamuelTrahanNOAA commented Jun 21, 2024

I've been testing this on top of #2326. The regression test has proven itself completely unusable for development due to the lack of error checking. Updating UPP and modulefiles required many changes that caused subsets of the tests to fail. A regression test system that is unable to differentiate between a test with changed results, and a test that could not run at all, is not useful for development.

@jkbk2004
Copy link
Collaborator

@SamuelTrahanNOAA Can you follow up to clean the super-linter complaint ?

@SamuelTrahanNOAA
Copy link
Collaborator Author

After updating this branch, I'm getting out-of-memory errors from some jobs on Hera when using Rocoto.

@SamuelTrahanNOAA
Copy link
Collaborator Author

After updating this branch, I'm getting out-of-memory errors from some jobs on Hera when using Rocoto.

The jobs succeeded on the second attempt. This may have been a temporary system issue.

@jkbk2004 jkbk2004 mentioned this pull request Jul 8, 2024
14 tasks
@jkbk2004
Copy link
Collaborator

@SamuelTrahanNOAA can you sync up branch? we can start working on this pr. We also want to combine #2265 and #2357 to this pr as well.

@BrianCurtis-NOAA
Copy link
Collaborator

@jkbk2004 There are a TON of FIXME in these code changes. This should not be combined into anything until those are addressed.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jul 11, 2024

@BrianCurtis-NOAA I agree we will do enough test of this new functionality of error checking. #2265 and #2357 are very minor PRs. Lets see how test goes. @zach1221 @FernandoAndrade-NOAA let's confirm how error checking runs on each machine.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I've only tested this with Rocoto. For other workflow managers, I used educated guesses when modifying the scripts. The ecFlow system has logic and scripts outside of the tests/*.sh files. I don't know how that'll break things.

@zach1221
Copy link
Collaborator

I've only tested this with Rocoto. For other workflow managers, I used educated guesses when modifying the scripts. The ecFlow system has logic and scripts outside of the tests/*.sh files. I don't know how that'll break things.

@SamuelTrahanNOAA can you please sync up your PR? @FernandoAndrade-NOAA and I can then test ecflow across the rdhpcs.

@zach1221 zach1221 added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jul 16, 2024
@zach1221
Copy link
Collaborator

@FernandoAndrade-NOAA @BrianCurtis-NOAA I think we're ready to begin the rest of the tests now.

@zach1221 zach1221 changed the title Restore error checking in regression test system. Restore error checking in regression test system. (Combined PR#2357 and PR#2265) Jul 17, 2024
@zach1221 zach1221 added jenkins-ort run ORT testing and removed jenkins-ort run ORT testing labels Jul 17, 2024
@SamuelTrahanNOAA
Copy link
Collaborator Author

Make sure you run the opnReqTest since this PR changes the regression test system itself. Anything related to testing should be tested.

@BrianCurtis-NOAA
Copy link
Collaborator

still no Acorn, can skip.

@zach1221
Copy link
Collaborator

Make sure you run the opnReqTest since this PR changes the regression test system itself. Anything related to testing should be tested.

@SamuelTrahanNOAA ORTs were run successfully. Logs are posted above for control_p8, regional_control and a cpld_control case.
@jkbk2004 the derecho queue really hasn't moved for me all day. Do we want to skip?

@SamuelTrahanNOAA
Copy link
Collaborator Author

the derecho queue really hasn't moved for me all day. Do we want to skip?

Derecho was down for unplanned maintenance for six hours. You could log in, but your jobs would not run. It came out of maintenance about two hours ago.

@zach1221
Copy link
Collaborator

Looks like the derecho jobs are going through now.

@zach1221
Copy link
Collaborator

Ok we should be ready now to proceed with merging.

@zach1221 zach1221 merged commit 6a6ce43 into ufs-community:develop Jul 18, 2024
3 checks passed
@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jul 19, 2024

@SamuelTrahanNOAA Thanks for all your work on this. I just ran a PR where I expected two failures. One was because an input file wasn't in the correct subdirectory in the input-data, and one because the answers were going to be different.

The first of these reported a failure FAILED: RUN DID NOT COMPLETE, which is great. But the second, where the run completed and all the comparisons were made (they were all different), still reported the failure as FAILED: UNABLE TO COMPLETE COMPARISON. Would that be what you expect?

@SamuelTrahanNOAA
Copy link
Collaborator Author

I didn't improve the wording of the comparison message. It would be great if they were more detailed than "unable to complete comparison."

@DeniseWorthen
Copy link
Collaborator

OK, thanks. I do think the message needs to indicate that it failed comparison, not that the comparison didn't complete . Maybe @BrianCurtis-NOAA can take a look at that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

modify report of test failures to clearly indicate when a test failed to compare because it did not run
9 participants