Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'max_blocks' estimation and add checks #587

Merged
merged 4 commits into from
Apr 2, 2021

Conversation

phil-blain
Copy link
Member

PR checklist

  • Short (1 sentence) summary of your PR:
    See title
  • Developer(s):
    P Blain
  • Suggest PR reviewers from list in the column to the right.
  • Please copy the PR test results link or provide a summary of testing completed below.
    Did not run tests yet.
  • How much do the PR code changes differ from the unmodified code?
    • bit for bit (expected)
    • different at roundoff level
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • No
  • Does this PR add any new test cases?
    • Yes
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)
    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please provide any additional information or relevant details below:

This is something else that I found during my work on coupling with NEMO, triggered by the grid/decomposition we use. I don't know if it's too late for the upcoming release, but I thought I'd open the PR so we can discuss if we want to include it.

Full exaplanations in the commit messages (the third commit could be dropped, if we feel it's not necessary to add such a CPP):

6b918b1:

ice_domain: revert changes to 'max_blocks' computation from fcbea1de (#377)

In fcbea1d (Added DMI/nuopc driver and fixed ANGLET (#377), 2019-11-22), the
computation in 'ice_domain::init_domain_blocks' that estimates 'max_blocks' if
the value '-1' is given in the namelist was changed without explanations.

The old computation was computing the number of blocks in the X and Y
directions, taking any necessary padding into account (by substracting 1 to
n[xy]_global, using integer division and adding 1), multiplying them to compute
the total number of blocks, and integer-dividing by `nprocs` to estimate the number
of blocks per processor.

The new computation does a similar computation, but it's unclear what it is
computing exactly. Since it uses floating point division and only casts the
result to an integer at the end of the computation, it systematically computes
a `max_blocks` value that is smaller than the old computation.

This leads to a `max_blocks` value that is systematically too small for the
cartesian decomposition when `block_size_x(y)` does not divide `nx(y)_global`
evenly.

Go back to the old computation.

Also, adjust the documentation to make it clearer that it's possible that
the `max_blocks` value computed by the model might not be appropriate.

87a44ac:

ice_distribution: check 'max_blocks' is enough for all distributions

The subroutines 'create_distrb_cart', 'create_distrb_rake' and
'create_distrb_spacecurve', in contrast to the other distribution-creating
subroutines in module ice_distribution, do not check if the index they are
about to access in the `blockIndex` array of the distribution they are creating
is smaller then `max_blocks`.

This results in an out-of-bound access when `max_blocks` is too small, and
potential segementation faults.

Add checks for these three distributions. Additionnally, for the cartesian
distribution, compute the required minimum value for `max_blocks`, which can
be done easily in this case, abort early, and inform the user of the required
minimum value.

be714da:

cicecore: add 'DEBUG_BLOCKS' CPP macro to debug block decomposition

As mentioned in the documentation, subroutines 'ice_blocks::create_blocks' and
'ice_distribution::create_local_block_ids' can print block information to
standard out if the local variable `dbug` is modified to ".true.".

For convenience, add a 'DEBUG_BLOCKS' CPP macro that can be used to easily
set both variables to true, and add a 'cice.setup' option to activate this
new macro.

Adjust the documentation accordingly.

…ICE-Consortium#377)

In fcbea1d (Added DMI/nuopc driver and fixed ANGLET (CICE-Consortium#377), 2019-11-22), the
computation in 'ice_domain::init_domain_blocks' that estimates 'max_blocks' if
the value '-1' is given in the namelist was changed without explanations.

The old computation was computing the number of blocks in the X and Y
directions, taking any necessary padding into account (by substracting 1 to
n[xy]_global, using integer division and adding 1), multiplying them to compute
the total number of blocks, and integer-dividing by `nprocs` to estimate the number
of blocks per processor.

The new computation does a similar computation, but it's unclear what it is
computing exactly. Since it uses floating point division and only casts the
result to an integer at the end of the computation, it systematically computes
a `max_blocks` value that is smaller than the old computation.

This leads to a `max_blocks` value that is systematically too small for the
cartesian decomposition when `block_size_x(y)` does not divide `nx(y)_global`
evenly.

Go back to the old computation.

Also, adjust the documentation to make it clearer that it's possible that
the `max_blocks` value computed by the model might not be appropriate.
The subroutines 'create_distrb_cart', 'create_distrb_rake' and
'create_distrb_spacecurve', in contrast to the other distribution-creating
subroutines in module ice_distribution, do not check if the index they are
about to access in the `blockIndex` array of the distribution they are creating
is smaller then `max_blocks`.

This results in an out-of-bound access when `max_blocks` is too small, and
potential segementation faults.

Add checks for these three distributions. Additionnally, for the cartesian
distribution, compute the required minimum value for `max_blocks`, which can
be done easily in this case, abort early, and inform the user of the required
minimum value.
Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new cpp -- those print statements can be very useful but they can also be overwhelmingly enormous, not something to have on all the time. This will need to be added to the 'case settings' table in the documentation.

I'll add @TillRasmussen to the reviewer list for discussion about the max_blocks calculation in ice_domain.F90. In general it's better to not use implicit types and integer calculations, but I can see that the newer version acts differently than the earlier one. Maybe there's a way to code this explicitly, so that it has the same result as the original formulation.

@phil-blain
Copy link
Member Author

This will need to be added to the 'case settings' table in the documentation.

Done.

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2021

This all looks fine to me. I too would like to understand the changes in the max_block computation over time. We don't want to be in a situation where two requirements are competing against each other and the implementation goes back and forth. I think all our testing automatically specifies a max block in the namelist that is specified or computed by the scripts and is probably even more conservative, so we may not be testing the internal computation. I think a main point is that we want an internal max blocks computation that will always generate a usable value which means it can be too large but should never be too small. If a user wants to fine tune that to reduce it to the min value, they can do that by manually setting the value in the namelist. My guess is that the "real" implementation resulted in a better computation in some cases, but was not conservative enough in others.

@phil-blain
Copy link
Member Author

phil-blain commented Mar 30, 2021

[...] I think all our testing automatically specifies a max block in the namelist that is specified or computed by the scripts and is probably even more conservative, so we may not be testing the internal computation.

Yes: cice.setup always calls cice_decomp.csh:

CICE/cice.setup

Line 867 in b720380

source ${casescr}/cice_decomp.csh

So if max_blocks is not set via cice.setup's -p argument, it gets set in the namelist to the value computed by cice_decomp.csh.

I think a main point is that we want an internal max blocks computation that will always generate a usable value which means it can be too large but should never be too small.

I agree. In my testing, even with the fix in this PR, it still does generate too small values, sometimes (but it then aborts cleanly with the checks that I added).

I looked at the cice_decomp.csh script and this is the computation that is done:

@ bx = $nxglob / ${blckx}
if ($bx * ${blckx} != $nxglob) @ bx = $bx + 1
@ by = $nyglob / ${blcky}
if ($by * ${blcky} != $nyglob) @ by = $by + 1
@ m = ($bx * $by) / ${task}
if ($m * ${task} != $bx * $by) @ m = $m + 1
set mxblcks = $m

So I think the only difference with what is in this PR is that the final max_blocks value is not rounded up if the number of procs does not divide the number of blocks evenly. So to do exactly the same computation it should be tweaked like this, I think:

diff --git i/cicecore/cicedynB/infrastructure/ice_domain.F90 w/cicecore/cicedynB/infrastructure/ice_domain.F90
index b7f0d08..d09aac5 100644
--- i/cicecore/cicedynB/infrastructure/ice_domain.F90
+++ w/cicecore/cicedynB/infrastructure/ice_domain.F90
@@ -192,8 +192,8 @@ subroutine init_domain_blocks
    call broadcast_scalar(add_mpi_barriers,  master_task)
    if (my_task == master_task) then
      if (max_blocks < 1) then
-       max_blocks=( ((nx_global-1)/block_size_x + 1) *         &
-                    ((ny_global-1)/block_size_y + 1) ) / nprocs
+       max_blocks=( (((nx_global-1)/block_size_x + 1) *         &
+                     ((ny_global-1)/block_size_y + 1) - 1) / nprocs + 1
        max_blocks=max(1,max_blocks)
        write(nu_diag,'(/,a52,i6,/)') &
          '(ice_domain): max_block < 1: max_block estimated to ',max_blocks

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2021

It seems like adding 1 may be worth doing at this point, especially if you are seeing cases where the max blocks is too small with the current computation. In my testing, I haven't seen much (if any) performance degradation from having max_blocks be a bit too large. It takes a little more memory, but then generally won't access that memory ever and memory access performance overall is probably not changed much. All we're doing is padding arrays in memory.

Ultimately, there could be decompositions that want to put more blocks on some tasks than others and this computation will not work for that. But this seems reasonable for now. thanks.

@phil-blain
Copy link
Member Author

I tweaked the computation to account for nprocs not dividing evenly the number of blocks.

@apcraig
Copy link
Contributor

apcraig commented Mar 30, 2021

GHActions is catching a compiler error with the latest change.

@phil-blain
Copy link
Member Author

Yeah I messed up the parenthesis 🤦 I've just fixed that and force-pushed

@phil-blain
Copy link
Member Author

OK so the CI failure is due to a broken connection from the runner to Zenodo:
https://github.com/CICE-Consortium/CICE/pull/587/checks?check_run_id=2230759756#step:11:58435

2021-03-30T19:22:27.0277120Z 2021-03-30 19:22:26 (4.69 MB/s) - Connection closed at byte 2537720689. Retrying.
2021-03-30T19:22:27.0277510Z 
2021-03-30T19:22:27.8913650Z --2021-03-30 19:22:27--  (try: 2)  https://zenodo.org/record/3728364/files/CICE_data_gx3_forcing_JRA55-20200320.tar.gz
2021-03-30T19:22:28.0625210Z Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
2021-03-30T19:22:28.5772140Z HTTP request sent, awaiting response... 429 Too Many Requests
2021-03-30T19:22:28.5773870Z 2021-03-30 19:22:28 ERROR 429: Too Many Requests.

The 'Download input data' step is successful in the workflow, despite wget encoutering an error. It's not the first time this happens, and I think I understand why. Each step in the workflow is ran by the shell specified by the "shell" keyword, or a platform default:
https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#using-a-specific-shell

We use the Cshell ("csh") (which in theory is not necessary, see #557) and thus use a custom shell:
https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#custom-shell

defaults:
run:
shell: /bin/csh {0}

but this invocation of csh does not ensure that any error encountered terminates the process, which is what is done by default for ex. Bash (see the Github documentation). So we should at least be using /bin/csh -e {0} to ensure errors are caught.
Also, we should invoke 'wget' with the --progress=dot:giga option to avoid countless number of lines in the CI log since the input data is really big. I'll make a note of these in a new issue.

For now, maybe re-running the workflow would work ?

@apcraig
Copy link
Contributor

apcraig commented Apr 1, 2021

If others agree, I think we could merge this. I will make one additional comment. I don't think we should use cpps when a namelist will work. CPPs should be reserved for things that cannot be implemented with namelist (like include netcdf, mpi, gtpl, pio, etc). There was an effort to clean up CPPs relatively recently, and I prefer that we not start creating new ones that are easily implemented as namelist. That also makes them even more usable as you can switch back and forth without having to rebuild. I know there is some gray area, but this doesn't not seem to be one of them. Can we switch the DEBUG_BLOCKS to a namelist variable?

@phil-blain
Copy link
Member Author

Yes, I agree. I used a CPP simply because the blocks debugging it's not used often so I figured in that case it would we one less condition in the code. But I can switch it to a namelist variable if we prefer to go that way.

@eclare108213
Copy link
Contributor

In this case, I think the performance hit of the extra conditional is probably negligible for code that's only exercised during initialization, and so the namelist option makes sense.

@phil-blain
Copy link
Member Author

Totally. plus, there is already an if(dbug) condition in the code! :P

…sition

As mentioned in the documentation, subroutines 'ice_blocks::create_blocks' and
'ice_distribution::create_local_block_ids' can print block information to
standard out if the local variable `dbug` is modified to ".true.".

For convenience, replace these local variables with a namelist variable,
'debug_blocks', and add a 'cice.setup' option to activate this new
variable in the namelist.

Adjust the documentation accordingly.
The value of 'max_blocks' as currently computed can still be too small
if the number of procs does not divide the number of blocks evenly.

Use the same 'minus one, integer divide, plus one' trick as is done for
the number of blocks in the X and Y directions in order to always
compute a 'max_blocks' value that is large enough.

This estimates the same value for 'max_blocks' as the 'cice_decomp.csh'
script computes:

    @ bx = $nxglob / ${blckx}
    if ($bx * ${blckx} != $nxglob) @ bx = $bx + 1
    @ by = $nyglob / ${blcky}
    if ($by * ${blcky} != $nyglob) @ by = $by + 1

    @ m = ($bx * $by) / ${task}
    if ($m * ${task} != $bx * $by) @ m = $m + 1
    set mxblcks = $m
@phil-blain
Copy link
Member Author

phil-blain commented Apr 1, 2021

I've pushed an updated version. The new commit is this one: efa7e28

Copy link
Contributor

@apcraig apcraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @phil-blain

@TillRasmussen
Copy link
Contributor

A late comment. I think that the reason that the dble division was introduced is to avoid the truncation towards zero that I understand happens when integer divisions returns a modulus larger than zero. In the integer divison this was accounted for when finding the number of blocks in x nblocks_x=(nx_global-1)/block_size_x+1 and similar for y. This was not accounted for when dividing with nprocs. I think that Phillipes solution will fix this.

This line max_blocks=max(1,max_blocks) should be removed as it will never be below 1.

@apcraig apcraig merged commit 3d0e747 into CICE-Consortium:master Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants