Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xr.concat consuming too much resources #1379

Open
rafa-guedes opened this issue Apr 20, 2017 · 4 comments
Open

xr.concat consuming too much resources #1379

rafa-guedes opened this issue Apr 20, 2017 · 4 comments
Labels

Comments

@rafa-guedes
Copy link
Contributor

Hi,
I am reading in several (~1000) small ascii files into Dataset objects and trying to concatenate them over one specific dimension but I eventually blow my memory up. The file glob is not huge (~700M, my computer has ~16G) and I can do it fine if I only read in the Datasets appending them to a list without concatenating them (my memory increases by 5% only or so by the time I had read them all).

However, when trying to concatenate each file into one single Dataset upon reading over a loop, the processing speeds drastically reduce before I have read 10% of the files or so and my memory usage keeps going up until it eventually blows up before I read and concatenate 30% of these files (the screenshot below was taken before it blew up, the memory usage was under 20% by the start of the processing).

I was wondering if this is expected, or if there something that could be improved to make that work more efficiently please. I'm changing my approach now by extracting numpy arrays from the individual Datasets, concatenating these numpy arrays and defining the final Dataset only at the end.

Thanks.

screenshot from 2017-04-21 11-14-27

@rafa-guedes
Copy link
Contributor Author

Also, reading all Datasets into a list and then trying to concatenate this list of Datasets at once also blows memory up.

@rafa-guedes
Copy link
Contributor Author

rafa-guedes commented Apr 21, 2017

I realised that some of the Datasets I was trying to concatenate had different coordinate values (for coordinates that I was assuming to be the same) so I guess xr.concat was trying to align these coordinates before concatenating and the resultant Dataset ended up being much larger than it should have been. When I ensure I only concatenate Datasets with consistent coordinates, I can do it.

However still resource consumption is quite high compared to when I so the same thing with numpy arrays. The memory increased by 42% using xr.concat (against 6% using np.concatenate) and the whole processing took about 4 times longer.

@shoyer
Copy link
Member

shoyer commented Apr 21, 2017

Alignment and broadcasting means that xarray.concat is inherently going to be slower than np.concatenate. But little effort has gone into optimizing it, so it is quite likely that performance could be improved with some effort.

My guess is that some combination of automatic alignment and/or broadcasting in concat is causing the issue with exploding memory usage here. See #1354 for related discussion -- contributions would certainly be welcome here.

@stale
Copy link

stale bot commented Mar 22, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Mar 22, 2019
@stale stale bot closed this as completed Apr 21, 2019
@dcherian dcherian reopened this Apr 21, 2019
@stale stale bot removed the stale label Apr 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants