Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RasterIO backend #1260

Merged
merged 42 commits into from
Jun 6, 2017
Merged

Add RasterIO backend #1260

merged 42 commits into from
Jun 6, 2017

Conversation

fmaussion
Copy link
Member

@fmaussion fmaussion commented Feb 10, 2017

Follow-up to #1070

This is my first backend so this is going to be a bit more work than I expected, but with your help we should be able to get through this.

A long todo list:

@fmaussion
Copy link
Member Author

Before I'll get more into details with what needs to be done with rasterio itself, I'd like to get some xarray internals ready first. No need to do a full review yet, but I'd appreciate help with the following points:

  1. The order of the variables and coordinates is random. The current __repr__ can look like this:
<xarray.Dataset>
Dimensions:  (band: 1, x: 4, y: 3)
Coordinates:
  * x        (x) float64 1.0 1.5 2.0 2.5
  * band     (band) int64 1
  * y        (y) float64 2.0 1.0 0.0
Data variables:
    lat      (y, x) float64 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
    raster   (band, y, x) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 ...
    lon      (y, x) float64 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
Attributes:
    crs: CRS({'wktext': True, 'proj': 'longlat', 'no_defs': True, 'ellps': 'WGS84'})
    transform: [1.0, 0.5, 0.0, 2.0, 0.0, -1.0]

How is this possible?

  1. Is it possible to make the x and y coords lazy? Could you point me on a place in the code where this has been done already?

  2. I'd like to have the lon and lat variables listed as optional coordinates, and also make them lazy. Any hint?

Thanks a lot for your help, I'm afraid this is going to need a few iterations ;-)

@shoyer
Copy link
Member

shoyer commented Feb 10, 2017

The order of the variables and coordinates is random.

Most likely you're using a dict instead of an OrderedDict somewhere.

Is it possible to make the x and y coords lazy?

See here for an example of lazily computing arrays:

class MaskedAndScaledArray(utils.NDArrayMixin):

There are other examples in that file, for decoding CF conventions.

I'd like to have the lon and lat variables listed as optional coordinates

Can you clarify what you mean by an optional coordinate?

@fmaussion
Copy link
Member Author

Can you clarify what you mean by an optional coordinate?

Yes sorry, I meant the have them listed as coordinates without * instead of Data variables.

@fmaussion
Copy link
Member Author

thanks @shoyer , I think I can work on this a bit further now and I'll get back to you if I have more questions.

@shoyer
Copy link
Member

shoyer commented Feb 11, 2017

Getting optional coordinates is hard is because the current "backends" system in xarray was really only designed for handling netCDF-like libraries.

For example, every backend gets passed through xarray.conventions.decode_cf to use CF conventions to decode its data. You would actually need to add a "global attribute" specifying coords='lat lon' to set coordinates variables with this system.

One alternative would be to simply write a function for reading files with rasterio straight into xarray Dataset objects.

There are probably hacks that can get this working with data stores but this really calls out for the bigger refactor like #1087.

@fmaussion
Copy link
Member Author

I made some progress with the lazy indexing, I'd be glad to have a first rough feedback on whether this is going in the right direction or not.

We have a decision to make regarding the API: I think that creating the optional lon and lat coords automatically is not a good idea:

  • in some cases, the x and y coordinates are already lons and lats and the 2D coords are obsolete
  • for big data files this is going to take ages and take a lot of memory
  • my initial idea to make them lazily evaluated might work, but in an ugly way: computing both lons and lats needs to be done in one single operation, and I'm not sure how this can be done in an elegant way
  • additionally, there is no way to make them show up as coordinates (as per @shoyer 's comment: Add RasterIO backend #1260 (comment))

The current implementation delegates this task to a utility function (get_latlon_coords_from_crs). It is currently very rasterio specific, but could be made more general -- API to be defined.

Thoughts?

@shoyer
Copy link
Member

shoyer commented Feb 15, 2017

my initial idea to make them lazily evaluated might work, but in an ugly way: computing both lons and lats needs to be done in one single operation, and I'm not sure how this can be done in an elegant way

This could be done sanely with dask.array (example of multiple outputs from an element-wise function).

additionally, there is no way to make them show up as coordinates (as per @shoyer 's comment: #1260 (comment))

We could make this work, it just looks pretty hacky.

@fmaussion
Copy link
Member Author

@shoyer thanks for the tips, I think that getting the lons and lats from dask is probably the most elegant method.

After trying various things I am still struggling with dask, and in particular on how to apply elemwise to functions like np.meshgrid (I get shape broadcasting errors). To get me started, I'd be grateful for an example on how to use dask to replace the code snippet below:

import xarray as xr
import numpy as np

ds = xr.DataArray(np.zeros((2, 3)),
                  coords={'x': np.arange(3), 'y': np.arange(2)},
                  dims=['y', 'x']).to_dataset(name='data')

# non-dask version
lon, lat = np.meshgrid(ds.x, ds.y)
ds['lon'] = (('y', 'x'), lon)
ds['lat'] = (('y', 'x'), lat)
ds.set_coords(['lon', 'lat'], inplace=True)
print(ds)

Thanks a lot!

If True, cache data loaded from the underlying datastore in memory as
NumPy arrays when accessed to avoid reading from the underlying data-
store multiple times. Defaults to True unless you specify the `chunks`
argument to use dask, in which case it defaults to False. Does not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last sentence can be safely removed.

chunks : int, tuple or dict, optional
Chunk sizes along each dimension, e.g., ``5``, ``(5, 5)`` or
``{'x': 5, 'y': 5}``. If chunks is provided, it used to load the new
DataArray into a dask array. This is an experimental feature; see the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably remove the "experimental feature" bit. Dask is not going away.

lock : False, True or threading.Lock, optional
If chunks is provided, this argument is passed on to
:py:func:`dask.array.from_array`. By default, a per-variable lock is
used when reading data from netCDF files with the netcdf4 and h5netcdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rasterio/GDAL need a thread lock? Let's remove netcdf4/h5netcdf specific references here and add sane defaults.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot safely share a Rasterio or GDAL dataset object among threads, so a lock may be needed. See https://trac.osgeo.org/gdal/wiki/FAQMiscellaneous#IstheGDALlibrarythread-safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I am on it.

@shoyer , I have trouble to understand what's going on in the open_dataset workflow. The documentation says: "By default, a per-variable lock is used when reading data from netCDF files with the netcdf4 and h5netcdf engines", but then the lock is decided once here. This will default to GLOBAL_LOCK and then passed to maybe_decode_store. Where is the "per variable" logic happening? Or is the GLOBAL_LOCK already doing this? (it is defined here)

(excuse my poor knowledge about these issues)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is out of sync here. We use a global lock for HDF5 (and should probably use one for Rasterio, too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand much more what's going on, and also why salem was having troubles when adding diagnostic variables to netcdf objects. Can I update the docs all at once here?

import dask
if LooseVersion(dask.__version__) < LooseVersion('0.6'):
raise ImportError(
'xarray requires dask version 0.6 or newer')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask v0.6 is pretty old now, maybe better to replace this whole conditional logic with:

import dask
from dask.base import tokenize

@shoyer
Copy link
Member

shoyer commented May 31, 2017 via email

@fmaussion
Copy link
Member Author

OK, all green.

Currently the rasterio tests are running on py36 only. Should I add rasterio to the other test suites as well?

@shoyer
Copy link
Member

shoyer commented May 31, 2017

Currently the rasterio tests are running on py36 only. Should I add rasterio to the other test suites as well?

Yes, except for the one that tests bare-minimal dependencies.

@gidden
Copy link
Contributor

gidden commented Jun 1, 2017

Hey @fmaussion, is this ready for me to try out again? I wanted to let you and @shoyer iterate first.

@fmaussion
Copy link
Member Author

@gidden yes absolutely please give it a try, I think it's ready

@fmaussion
Copy link
Member Author

@gidden I just updated the documentation recipe to use an accessor to compute the lons and lats, let me know what you think

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, just a couple more very minor points


# Define the accessor
@xr.register_dataarray_accessor('rasterio')
class RasterioAccessor(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards keeping this as just a function -- it keeps the example simpler and an accessor is probably overkill for this problem.

return out


def open_rasterio(filename, chunks=None, cache=None, lock=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be lock=None to use the global lock.

token = tokenize(filename, mtime, chunks)
name_prefix = 'open_rasterio-%s' % token
if lock is None:
lock = GLOBAL_LOCK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, can you make a separate global lock for rasterio? We can safely use GDAL and HDF5 at the same time.

@@ -1438,6 +1438,265 @@ class TestPyNioAutocloseTrue(TestPyNio):
autoclose = True


@requires_rasterio
class TestRasterio(CFEncodedDataTest, Only32BitTypes, TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any advantage to inheriting from the other test methods, given that this isn't an actual datastore. I think this can just inherit from TestCase (then you can delete the first two methods).

with xr.open_dataarray(tmp_nc_file) as ncds:
assert_identical(rioda, ncds)

# Create a geotiff file in latlong proj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's self-contained, make this a separate test method?

@shoyer shoyer mentioned this pull request Jun 5, 2017
@fmaussion
Copy link
Member Author

OK, let's get this one in and see what people will report about it.

Thanks @shoyer for your patience, @NicWayand and @jhamman for the original PR, @gidden for the testing/reviews and @sgillies for rasterio!

@jhamman
Copy link
Member

jhamman commented Jun 6, 2017

Nice work @fmaussion. Thanks for bringing this home.

@PeterDSteinberg
Copy link

+1 @fmaussion - Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants