Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sketch profiling #4

Closed
wants to merge 1 commit into from
Closed

sketch profiling #4

wants to merge 1 commit into from

Conversation

vincentsarago
Copy link
Member

@vincentsarago vincentsarago commented May 16, 2023

Simple profiler to count the number of S3 GET request and get other statistics

@profile(add_to_return=True, cprofile=True, quiet=True)
def get_tile(src_path: str, x: int, y: int, z: int, *, variable: str, **kwargs: Any):

    with xarray_open_dataset(
        src_path,
        z=z,
        **kwargs,
    ) as dataset:
        
        dataarray, _ = update_dataset(dataset, variable=variable)
        
        with XarrayReader(dataarray) as src_dst:
            return src_dst.tile(x, y, z)

_, logs = get_tile(
    "s3://power-analysis-ready-datastore/power_901_monthly_meteorology_utc.zarr", 
    4, 
    4, 
    4, 
    variable="TS",
)

print(logs)
{'GET': 52,
 'Timing': 5.186955213546753,
 'cprofile': ['   ncalls  tottime  percall  cumtime  percall filename:lineno(function)',
  "       40    4.844    0.121    4.844    0.121 {method 'acquire' of '_thread.lock' objects}",
  "        1    0.027    0.027    0.028    0.028 {method 'start' of 'rasterio._env.GDALEnv' objects}",
  '        8    0.017    0.002    0.021    0.003 core.py:2142(_decode_chunk)',
  '        1    0.015    0.015    0.015    0.015 {rasterio._warp._reproject}',
  '      408    0.011    0.000    0.015    0.000 typing.py:1065(_get_protocol_attrs)',
  '113732/112582    0.011    0.000    0.042    0.000 {built-in method builtins.isinstance}',
  '2221/2189    0.010    0.000    0.014    0.000 indexing.py:512(shape)',
  '        9    0.007    0.001    0.007    0.001 crs.py:183(__init__)',
  '     2437    0.006    0.000    0.082    0.000 variable.py:194(as_compatible_data)',
  '39365/33478    0.004    0.000    0.005    0.000 {built-in method builtins.len}']}

@abarciauskas-bgse
Copy link
Contributor

@vincentsarago looking at the results, seeing most of the time being spent on method 'acquire' of '_thread.lock' objects, wondering if using multithreading is an option - do you know much about multithreaded reading with xarray or zarr libraries? I'm only seeing a little bit out there on this, mostly indicated dask is the way to do parallel reading of zarr data or perhaps also that we could use Zarr's threadsynchronizer or blosc.use_threads=True https://zarr.readthedocs.io/en/v2.14.2/tutorial.html#configuring-blosc

cc @sharkinsspatial

@sharkinsspatial
Copy link
Member

@abarciauskas-bgse As I mentioned in the Slack thread, can you configure s3fs logging via https://s3fs.readthedocs.io/en/latest/#logging.

@sharkinsspatial
Copy link
Member

Martin raises a good question here pydata/xarray#6033 (comment) that we should first be trying to understand if xarray is loading chunks serially or concurrently.

@sharkinsspatial
Copy link
Member

pydata/xarray#1385

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented May 22, 2023

@vincentsarago - @sharkinsspatial discussed today if we could compare tiling using XarrayReader and pgstac. I started to look into how to do this, and noticed you have some code setup in titiler-pgstac to load some items into a local pgstac instance and then use siege, https://github.com/stac-utils/titiler-pgstac/tree/main/benchmark (I know you have mentioned this before)

I started to work on creating a simplified version of the tile method in pgstac for code profiling here: https://github.com/stac-utils/titiler-pgstac/compare/main...abarciauskas-bgse:titiler-pgstac:ab/wip-simple-pgstac-profiling?expand=1 but I'm pretty new to that codebase so if you have any feedback let me know - I opened this issue: stac-utils/titiler-pgstac#96

One thing I know I will have to change about that implementation is the data itself should really be stored on s3 to compare tiling Zarrs 🍎s with COG 🍏s when both are stored on S3. @sharkinsspatial do you have a S3 COG dataset (looks like there are a lot on AWS you suggest we use or should we create one?

@vincentsarago
Copy link
Member Author

@abarciauskas-bgse as I commented in titiler-pgstac, I'm not sure to understand the need to compare mosaic tiling while we have not yet done simple non-mosaic tile benchmarking which will give you 99% of the difference between the COG/Zarr in the mosaic world 🤷

What you started is great to benchmark mosaic tile, but using global COG is something we usually don't encounter (because there is no need to mosaic them then). You'll end up benchmarking two things:

  • the pgstac response to the tile request (which I wanted to do in https://github.com/vincentsarago/pgstac-benchmark)
  • the multithreading mosaic reader (but in this case of global COG, it will be kinda weird because the final tile will use only ONE COG, while the tiler might have opened 10 files depending on the CPU available)

@abarciauskas-bgse
Copy link
Contributor

@vincentsarago thanks for this - I agree that we need to put more thought into the approach to compare COG with Zarr tiling performance. Do you mind putting ⬆️ comments in the discussion stac-utils/titiler-pgstac#97 as well?

@abarciauskas-bgse
Copy link
Contributor

@vincentsarago is it ok if we close this since the work has moved to https://github.com/developmentseed/tile-benchmarking?

@vincentsarago
Copy link
Member Author

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants