Parallel interpolation #108

johnomotani · 2020-03-21T17:56:01Z

Provides methods BoutDataArray.getHighRes() to get a version of a variable interpolated in the parallel direction to increase the poloidal resolution, or BoutDataSet.getHighResVars(['var1', 'var2', ...) to get a Dataset with high-resolution versions of a list of variables. An example from TORPEX simulations - before interpolation

and after

Also adds a feature to the tests - can use @pytest.mark.long to mark a test as long, in which case it is skipped by default. Long tests are run if --long argument is passed to pytest, and the Travis tests do run the long tests.

Includes #107, this PR will be only 1,384 additions and 363 deletions after that is merged.

Need to re-chunk grid after using xr.concat to re-join the grid Dataset with upper boundary points removed. Otherwise 'y' is re-chunked into at least two parts.

If set to false, does not store the results in the Dataset, in order to save memory.

Making zShift a coordinate means that the Dataset is no longer required for the interpolation, so move the method to the BoutDataArray.

Provides a workaround for when the boundary cells were not saved in the data file.

Use coordinate ranges stored in Region instead.

Returns a dict.

Instead, will provide functionality to save the high-resolution variables into a new Dataset.

Distribute the output points of the high-resolution field like a standard BOUT++ cell-centred variable.

Increase jyseps*, ny, ny_inner, MYSUB to reflect the new resolution.

This method only actually does one thing, and is not required to be called before parallel interpolation, so rename to be clearer, and make 'n' argument non-optional.

By converting the DataArrays in each region into Datasets, can combine with xarray.combine_by_coords. Then is natural to return a Dataset, making it more straightforward to merge the results of calling this method for several variables into a single Dataset.

Information in regions is not correct for the high-res variable, and needs to be recalculated later.

In BoutDataArray.highParallelRes(), copy the attrs from the first part of the variable to the combined Dataset.

After interpolating to higher parallel resolution, a Dataset has the correct coordinates, but no regions. This commit makes add_toroidal_geometry_coords and add_s_alpha_geometry_coords skip adding coordinates if the coordinates are already present, so that the functions can be applied again to interpolated Datasets. At the moment, the only thing this does is to re-create the regions.

Need to slice hthe with 'y' instead of 'theta' if it was read from grid file.

Adding 'dy' as a coordinate allows it to be assembled correctly when DataArrays are combined with combine_by_coords, which is much more straightforward than recalculating it from the y-coordinate. When initialising a Dataset from the interpolated variables, will demote 'dy' to a variable again.

Add method BoutDataset.getHighParallelResVars() that takes a list of variables, and returns a new BoutDataset containing those variables with an increased parallel resolution. The new Dataset is a fully valid BoutDataset, so all plotting methods, etc. work.

xarray.testing also provides an assert_allclose function, so it is clearer to be explicit about which module the function belongs to.

Attributes like 'direction_y' only make sense for a particular DataArray, not the whole Dataset.

Some coordinates corresponding to x (calculated from the index), y (calculated from dy) and z (calculated from ZMIN and ZMAX) can always be created, although they might be named differently. So create them in the top-level apply_geometry() function, not the registered functions for particular geometries.

Previously were off by half a grid-cell.

Needed to pass checks in toFieldAligned().

For interpolation, where there is a physical boundary, want the limit of the coordinate (that is stored in the region) to be the global coordinate value at the boundary, not at the grid edge (which was what was stored previously).

Conflicts: xbout/boutdataset.py xbout/geometries.py xbout/tests/test_boutdataset.py

Issue with merging attrs has been fixed in xarray-0.16.0, so can remove workaround, as well as fixing problem with inconsistent regions with new default compat="no_conflicts" for xarray's combine_by_coords().

The result to be returned is updated_ds, checking ds meant always adding a new xcoord to updated_ds, even if it was already added by add_geometry_coords().

Ensure 'metadata', 'options', 'regions' and 'geometry' attributes are always added to all coordinates. Ensures consistency between original and saved-and-reloaded Datasets, allowing some workarounds in tests to be removed.

johnomotani · 2020-07-29T18:37:35Z

Merge conflicts fixed now. Also updated some handling of attrs, because xarray-0.16.0 fixed some attrs-propagation allowing removal of some workarounds. In the process, I've made all 'coordinate' variables always have "metadata", "options", "regions" and "geometry" attrs, like the 'data_vars' variables do, which made things more consistent between a BoutDataset opened from *.dmp.*.nc files and one saved and re-loaded by xBOUT, simplifying the save-and-reload tests.

Adding attrs to the 'ycoord' coordinate in d062fa9 made interpolate_parallel() very slow. Don't understand why, but adding 'da = da.compute()' before the interpolation restores the speed.

johnomotani · 2020-07-30T12:28:44Z

That was a very strange issue. Tests failed on d062fa9 because they timed out. Turns out adding attrs to the theta coordinate made .interp() really slow in BoutDataArray.interpolate_parallel(). Adding a da = da.compute() before the da.interp() call fixed the slow-down. I guess adding attrs to the theta coordinate somehow messed up the dask task-graph, but don't even begin to understand how, or why calling da = da.compute() would fix it!

xarray-0.16.0 is required now, older versions will fail the tests.

xarray requires less-than-6-months old dask, so 0.16.0 requires dask-2.10.

codecov-commenter · 2020-07-30T16:21:55Z

Codecov Report

Merging #108 into master will increase coverage by 0.46%.
The diff coverage is 81.59%.

@@            Coverage Diff             @@
##           master     #108      +/-   ##
==========================================
+ Coverage   70.83%   71.30%   +0.46%     
==========================================
  Files          14       14              
  Lines        1519     1697     +178     
  Branches      306      359      +53     
==========================================
+ Hits         1076     1210     +134     
- Misses        353      382      +29     
- Partials       90      105      +15

Impacted Files	Coverage Δ
xbout/plotting/animate.py	`42.02% <0.00%> (ø)`
xbout/plotting/plotfuncs.py	`6.54% <0.00%> (ø)`
xbout/plotting/utils.py	`16.66% <18.18%> (ø)`
xbout/load.py	`81.39% <50.00%> (-0.28%)`	⬇️
xbout/boutdataset.py	`74.68% <75.43%> (-3.22%)`	⬇️
xbout/boutdataarray.py	`82.65% <87.01%> (+2.65%)`	⬆️
xbout/utils.py	`92.00% <87.09%> (-8.00%)`	⬇️
xbout/geometries.py	`78.57% <88.23%> (+4.46%)`	⬆️
xbout/region.py	`88.40% <90.24%> (-2.41%)`	⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a89feb6...8e3e14a. Read the comment docs.

Removing attrs from y-coordinate means we do not need to call da.compute(), which would load the entire result into memory. It is better not to, as the result may be sliced or processed somehow later and we don't want to force loading in case the variable is too large to fit in memory.

This was intended to be moved from add_toroidal_geometry_coords() into apply_geometry(), but ended up being added back into add_toroidal_geometry_coords() in a merge.

The coordinates of a DataArray that has been interpolated will have attrs that are not consistent with the new DataArray. This commit updates _update_metadata_increased_resolution() to also replace the attrs of the DataArray's coords with the attrs of the new DataArray.

johnomotani · 2020-07-30T19:27:20Z

As @TomNicholas noticed when discussing the performance regression issue today, using da = da.compute() forces computing and loading into memory the whole result, which we'd like to avoid if possible - especially if the variable being interpolated is too big to fit into memory after interpolation. I poked a bit more, and found that deleting the "regions" attr (which is a dict of Region objects, so I guess relatively big) on the 'theta' coordinate also fixed the performance regression. Coordinates always have their data in numpy arrays, not dask arrays. I still can't manage to make a simple reproducer, but apparently attrs on a numpy-backed variable passed to dask.array.map_blocks() (which is used down in the xarray interp methods) can slow things down by a factor of 10 (this was for a unit test case with stupidly small arrays and stupidly small chunks, so might not translate to a normal dataset, but there aren't that many chunks, and the test was taking ~35s in all, which seems high for the tiny array it's working on, although there are a lot of variables that I added as coordinates that also need interpolating). With the fix, the unit test only takes ~11s. Some of the cases of the same test (see below) with vars_to_interpolate=... (so all variables in the Dataset are interpolated) were taking ~6mins before the fix, and ~30s now.

@TomNicholas - I'd like to put this as an FYI issue on xarray, but without being able to reproduce for the devs there it seems like a waste of time. The case I've been checking is time pytest --long -s -k test_interpolate_parallel[vars_to_interpolate0-False-False-guards0, and commenting out this bit

xBOUT/xbout/boutdataarray.py

Lines 314 to 320 in 66b286a

    
           # This prevents da.interp() from being very slow. 
        
           # Apparently large attrs (i.e. regions) on a coordinate which is passed as an 
        
           # argument to dask.array.map_blocks() slow things down, maybe because coordinates 
        
           # are numpy arrays, not dask arrays? 
        
           # Slow-down was introduced in d062fa9e75c02fbfdd46e5d1104b9b12f034448f when 
        
           # _add_attrs_to_var(updated_ds, ycoord) was added in geometries.py 
        
           da[ycoord].attrs = {}

should re-introduce the slow-down if anyone has time to take a look.

TomNicholas · 2020-07-30T20:06:43Z

Thanks for this report @johnomotani . This behaviour is definitely something I would like to understand, and ideally flag up with an issue on xarray. But for that I think it would need a reproducible example though. With this I can come back to it later at least.

Do you want me to review this so you can merge it and move on?

johnomotani · 2020-07-30T20:31:37Z

Do you want me to review this so you can merge it and move on?

That would be great! 👍

Staggered grid cases are not implemented yet, would need to use zShift_CELL_XLOW or zShift_CELL_YLOW (which may or may not be present in the Dateset, depending on the PhysicsModel).

Not all dump files (especially older ones) have cell_location attrs written, so if none is present, assume it's OK to do toFieldAligned and fromFieldAligned with the cell-centre zShift since we cannot check.

The index-value coordinates are now added for dimensions without coordinates after the geometry is applied, so no 'x' coordinate has been created to drop.

cell_location attribute only makes sense for a DataArray not a whole Dataset, so remove in to_dataset() method.

Seems to be required at the moment to avoid an import error in the minimum-package-versions test.

johnomotani · 2020-08-18T10:46:46Z

Tests pass now! Any more review comments before I merge?

johnomotani added 30 commits March 21, 2020 17:09

Chunk grid file after removing upper boundary points

1277532

Need to re-chunk grid after using xr.concat to re-join the grid Dataset with upper boundary points removed. Otherwise 'y' is re-chunked into at least two parts.

Methods for parallel interpolation of variables

bd0e88d

Caching argument for getHighParallelResRegion

f21d3d6

If set to false, does not store the results in the Dataset, in order to save memory.

Cache high-res variables in Region, not in Dataset

f600aaa

Move highParallelResRegion to BoutDataArray from BoutDataset

f84d919

Making zShift a coordinate means that the Dataset is no longer required for the interpolation, so move the method to the BoutDataArray.

When interpolating, extrapolate at the boundaries if necessary

ae50094

Provides a workaround for when the boundary cells were not saved in the data file.

Remove uses of dx and dy in highParallelResRegion

51e04d2

Use coordinate ranges stored in Region instead.

Method to create high resolutions of a variable for all regions

2e63880

Returns a dict.

Remove caching of high-resolution variables

9213bff

Instead, will provide functionality to save the high-resolution variables into a new Dataset.

Update definition of y_fine used for parallel interpolation

845e2eb

Distribute the output points of the high-resolution field like a standard BOUT++ cell-centred variable.

Update metadata of high-resolution variable

5ab0ecf

Increase jyseps*, ny, ny_inner, MYSUB to reflect the new resolution.

Rename setupParallelInterp to resetParallelInterpFactor

a74f325

This method only actually does one thing, and is not required to be called before parallel interpolation, so rename to be clearer, and make 'n' argument non-optional.

Update default to n=8 in highParallelResRegion

2bb80f3

Remove regions attr when creating high-res interpolated variable

0f1c64a

Information in regions is not correct for the high-res variable, and needs to be recalculated later.

Workaround for combine_by_coords not keeping attrs

67a2939

In BoutDataArray.highParallelRes(), copy the attrs from the first part of the variable to the combined Dataset.

Fix applying s-alpha geometry

15dfe81

Need to slice hthe with 'y' instead of 'theta' if it was read from grid file.

Only select toroidal_points if variable has z-dimension

ac7e29a

Test for _update_metadata_increased_resolution()

dbd2862

Test for BoutDataSet.resetParallelInterpFactor()

06fdb91

import numpy.testing as npt instead of importing assert_allclose

373c43b

xarray.testing also provides an assert_allclose function, so it is clearer to be explicit about which module the function belongs to.

Tests for BoutDataArray.highParallelResRegion()

8d10dde

Drop attrs that do not belong in Dataset in BoutDataArray.to_dataset()

5ac595f

Attributes like 'direction_y' only make sense for a particular DataArray, not the whole Dataset.

Fix the coordinate calculations in Region.__init__()

ba2b5b4

Previously were off by half a grid-cell.

Add 'direction_y' attrs to test data

c804396

Needed to pass checks in toFieldAligned().

Fix region coordinate limits at boundaries

05138df

For interpolation, where there is a physical boundary, want the limit of the coordinate (that is stored in the region) to be the global coordinate value at the boundary, not at the grid edge (which was what was stored previously).

Merge branch 'master' into parallel-interpolation

10fda87

johnomotani mentioned this pull request May 13, 2020

Fix whitespace in exception message #123

Merged

johnomotani added 4 commits July 29, 2020 16:18

Merge branch 'master' into parallel-interpolation

46e30d2

Conflicts: xbout/boutdataset.py xbout/geometries.py xbout/tests/test_boutdataset.py

Require xarray-0.16.0, fix merging attrs in interpolate_parallel()

b642545

Issue with merging attrs has been fixed in xarray-0.16.0, so can remove workaround, as well as fixing problem with inconsistent regions with new default compat="no_conflicts" for xarray's combine_by_coords().

Check for xcoord in updated_ds not ds

f313bc1

The result to be returned is updated_ds, checking ds meant always adding a new xcoord to updated_ds, even if it was already added by add_geometry_coords().

More consistent attrs on coordinates

d062fa9

Ensure 'metadata', 'options', 'regions' and 'geometry' attributes are always added to all coordinates. Ensures consistency between original and saved-and-reloaded Datasets, allowing some workarounds in tests to be removed.

Fix performance regression in BoutDataArray.interpolate_parallel()

56a597c

Adding attrs to the 'ycoord' coordinate in d062fa9 made interpolate_parallel() very slow. Don't understand why, but adding 'da = da.compute()' before the interpolation restores the speed.

johnomotani added 2 commits July 30, 2020 14:03

Update Travis config for minimum versions to xarray-0.16.0

ed27def

xarray-0.16.0 is required now, older versions will fail the tests.

Update minimum dask to minimum version supported by xarray-0.16.0

dfab775

xarray requires less-than-6-months old dask, so 0.16.0 requires dask-2.10.

johnomotani added 3 commits July 30, 2020 20:03

Clean up merge - remove duplicated code adding 1d coordinates

ff1bc30

This was intended to be moved from add_toroidal_geometry_coords() into apply_geometry(), but ended up being added back into add_toroidal_geometry_coords() in a merge.

johnomotani added 6 commits August 1, 2020 21:57

Add checks for cell_location in toFieldAligned and fromFieldAligned

629aacf

Staggered grid cases are not implemented yet, would need to use zShift_CELL_XLOW or zShift_CELL_YLOW (which may or may not be present in the Dateset, depending on the PhysicsModel).

Add staggered zShift variables as coordinates if they exist

66e6db9

set_coords() raises ValueError not KeyError when variable not present

9494f2e

Allow toFieldAligned and fromFieldAligned when no cell_location attr

324179a

Not all dump files (especially older ones) have cell_location attrs written, so if none is present, assume it's OK to do toFieldAligned and fromFieldAligned with the cell-centre zShift since we cannot check.

Add cell_location attr to variables in unit test Datasets

af22f10

Remove drop('x') from add_s_alpha_geometry_coords()

585330f

The index-value coordinates are now added for dimensions without coordinates after the geometry is applied, so no 'x' coordinate has been created to drop.

johnomotani force-pushed the parallel-interpolation branch from 43fe97b to 585330f Compare August 17, 2020 14:43

johnomotani added 2 commits August 17, 2020 19:49

Drop cell_location attr from Dataset in BoutDataArray.to_dataset()

31debc0

cell_location attribute only makes sense for a DataArray not a whole Dataset, so remove in to_dataset() method.

Add fsspec to PIP_PACKAGES in minimum versions Travis test

8e3e14a

Seems to be required at the moment to avoid an import error in the minimum-package-versions test.

johnomotani merged commit 2c8ae2a into master Aug 29, 2020

johnomotani deleted the parallel-interpolation branch August 29, 2020 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel interpolation #108

Parallel interpolation #108

johnomotani commented Mar 21, 2020

johnomotani commented Jul 29, 2020

johnomotani commented Jul 30, 2020

codecov-commenter commented Jul 30, 2020 •

edited

Loading

johnomotani commented Jul 30, 2020

TomNicholas commented Jul 30, 2020

johnomotani commented Jul 30, 2020

johnomotani commented Aug 18, 2020

Parallel interpolation #108

Parallel interpolation #108

Conversation

johnomotani commented Mar 21, 2020

johnomotani commented Jul 29, 2020

johnomotani commented Jul 30, 2020

codecov-commenter commented Jul 30, 2020 • edited Loading

Codecov Report

johnomotani commented Jul 30, 2020

TomNicholas commented Jul 30, 2020

johnomotani commented Jul 30, 2020

johnomotani commented Aug 18, 2020

codecov-commenter commented Jul 30, 2020 •

edited

Loading