xr.concat consuming too much resources #1379

rafa-guedes · 2017-04-20T23:33:52Z

Hi,
I am reading in several (~1000) small ascii files into Dataset objects and trying to concatenate them over one specific dimension but I eventually blow my memory up. The file glob is not huge (~700M, my computer has ~16G) and I can do it fine if I only read in the Datasets appending them to a list without concatenating them (my memory increases by 5% only or so by the time I had read them all).

However, when trying to concatenate each file into one single Dataset upon reading over a loop, the processing speeds drastically reduce before I have read 10% of the files or so and my memory usage keeps going up until it eventually blows up before I read and concatenate 30% of these files (the screenshot below was taken before it blew up, the memory usage was under 20% by the start of the processing).

I was wondering if this is expected, or if there something that could be improved to make that work more efficiently please. I'm changing my approach now by extracting numpy arrays from the individual Datasets, concatenating these numpy arrays and defining the final Dataset only at the end.

Thanks.

rafa-guedes · 2017-04-20T23:41:38Z

Also, reading all Datasets into a list and then trying to concatenate this list of Datasets at once also blows memory up.

rafa-guedes · 2017-04-21T00:54:28Z

I realised that some of the Datasets I was trying to concatenate had different coordinate values (for coordinates that I was assuming to be the same) so I guess xr.concat was trying to align these coordinates before concatenating and the resultant Dataset ended up being much larger than it should have been. When I ensure I only concatenate Datasets with consistent coordinates, I can do it.

However still resource consumption is quite high compared to when I so the same thing with numpy arrays. The memory increased by 42% using xr.concat (against 6% using np.concatenate) and the whole processing took about 4 times longer.

shoyer · 2017-04-21T07:42:45Z

Alignment and broadcasting means that xarray.concat is inherently going to be slower than np.concatenate. But little effort has gone into optimizing it, so it is quite likely that performance could be improved with some effort.

My guess is that some combination of automatic alignment and/or broadcasting in concat is causing the issue with exploding memory usage here. See #1354 for related discussion -- contributions would certainly be welcome here.

stale · 2019-03-22T10:45:41Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

fmaussion mentioned this issue Aug 1, 2017

Update doc example for open_mfdataset #1498

Closed

smithara mentioned this issue Oct 15, 2018

Full xarray.Dataset support ESA-VirES/VirES-Python-Client#3

Closed

stale bot added the stale label Mar 22, 2019

stale bot closed this as completed Apr 21, 2019

dcherian reopened this Apr 21, 2019

stale bot removed the stale label Apr 21, 2019

dcherian added topic-performance topic-combine combine/concat/merge labels Jul 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.concat consuming too much resources #1379

xr.concat consuming too much resources #1379

rafa-guedes commented Apr 20, 2017

rafa-guedes commented Apr 20, 2017

rafa-guedes commented Apr 21, 2017 •

edited

Loading

shoyer commented Apr 21, 2017

stale bot commented Mar 22, 2019

xr.concat consuming too much resources #1379

xr.concat consuming too much resources #1379

Comments

rafa-guedes commented Apr 20, 2017

rafa-guedes commented Apr 20, 2017

rafa-guedes commented Apr 21, 2017 • edited Loading

shoyer commented Apr 21, 2017

stale bot commented Mar 22, 2019

rafa-guedes commented Apr 21, 2017 •

edited

Loading