Can we get batch data use `df.to_pandas()` in the case of big data? #345

kxbin · 2021-03-25T08:08:07Z

Hello, everybody, There are one question about DataFrame.to_pandas() API

Can we get batch data use df.to_pandas() in the case of big data?

For example:
There are 100 million rows data in Elasticsearch
Then please look at the code below

df = ed.DataFrame(...)         // Now, df have 100 million rows
df.to_pandas(show_progress=True) // This will very slowly, because Elasticsearch only can return 10,000 rows per query

Can we use like this?

iterator = df.to_pandas_by_batch(batch_size=1000)
for batch in iterator:
    df_batch = batch.to_pandas()

Is there any other way to do the same things? Thanks！

The text was updated successfully, but these errors were encountered:

sethmlarson · 2021-07-31T17:06:44Z

This is a great feature suggestion, would you be interested in implementing this? I can assist if needed.

kxbin · 2021-08-03T02:32:49Z

Yeah, I am very happy to implement it and pull a requests.

Related to this pull requests:
Add to_pandas_in_batch() DataFrame API #369

Please help to see if the code conforms to the specification, Thanks for your help!

This is a great feature suggestion, would you be interested in implementing this? I can assist if needed.

NickolayVasilishin · 2021-08-09T20:47:08Z

Hi guys!
Not exactly related, but still regarding big data - did you think about parallelized scans of ES? Like to have something like multiprocessing.Pool with ES sliced scans?
I was looking into Eland sources and saw that you've just eliminated scans from the code.

sethmlarson · 2021-08-10T15:08:41Z

@NickolayVasilishin Thanks for the suggestion. Chatted with the team about this and I have a few things to report:

Using sliced scroll for retrieving documents isn't faster than normal scroll if the documents are going to the same place. Sliced scroll is more to make the multiple workers use-case better.
Slices are coming to Point-in-Time searches in 7.15: Support search slicing with point-in-time elasticsearch#74457 but due to the above point we probably won't see performance gains by using them.

NickolayVasilishin · 2021-08-11T12:55:41Z

@sethmlarson thanks for the reply.

Yes, exactly, so that's why I'm talking about having a multiprocessing.Pool for example.
So for large scans (e.g. >1M documents) one can set something like to_pandas(parallelism=10) to utilize 10 cores with subprocesses retrieving results independently and collecting them into a single resulting pandas.DataFrame.

Currently, I'm patching eland.operations.search_yield_hits function in my project with my parallelized function in order to achieve that and would be happy to have this possibility on API level.

sethmlarson · 2021-08-11T17:30:12Z

@NickolayVasilishin Gotcha, I assumed that's what you meant but maybe I should try doing some testing myself over large data sets. I'll report back with my findings, thanks! Also I wouldn't recommend depending on anything in the eland.operations module, that's a private API (maybe we need to mark that better).

NickolayVasilishin · 2021-08-14T21:23:12Z

@sethmlarson thanks! I'd be happy to help with that.

Yes, it's pretty clear that this function is private and patching is very dangerous in terms of versions compatibility, I think, so no need for additional marks on that.
However, I needed that functionality to test Eland in my project, so I take that risk.

sethmlarson added enhancement New feature or request help wanted Solution is fleshed out and ready to be worked on labels Jul 31, 2021

kxbin mentioned this issue Aug 3, 2021

Add iterrows() and itertuples() DataFrame API, its usage is similar to pandas #369

Closed

sethmlarson mentioned this issue Aug 20, 2021

Add iterrows() and itertuples() DataFrame API, its usage is similar to pandas #380

Merged

sethmlarson closed this as completed in #380 Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we get batch data use `df.to_pandas()` in the case of big data? #345

Can we get batch data use `df.to_pandas()` in the case of big data? #345

kxbin commented Mar 25, 2021 •

edited

Loading

sethmlarson commented Jul 31, 2021

kxbin commented Aug 3, 2021

NickolayVasilishin commented Aug 9, 2021

sethmlarson commented Aug 10, 2021

NickolayVasilishin commented Aug 11, 2021

sethmlarson commented Aug 11, 2021 •

edited

Loading

NickolayVasilishin commented Aug 14, 2021

Can we get batch data use df.to_pandas() in the case of big data? #345

Can we get batch data use df.to_pandas() in the case of big data? #345

Comments

kxbin commented Mar 25, 2021 • edited Loading

sethmlarson commented Jul 31, 2021

kxbin commented Aug 3, 2021

NickolayVasilishin commented Aug 9, 2021

sethmlarson commented Aug 10, 2021

NickolayVasilishin commented Aug 11, 2021

sethmlarson commented Aug 11, 2021 • edited Loading

NickolayVasilishin commented Aug 14, 2021

Can we get batch data use `df.to_pandas()` in the case of big data? #345

Can we get batch data use `df.to_pandas()` in the case of big data? #345

kxbin commented Mar 25, 2021 •

edited

Loading

sethmlarson commented Aug 11, 2021 •

edited

Loading