Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we get batch data use df.to_pandas() in the case of big data? #345

Closed
kxbin opened this issue Mar 25, 2021 · 7 comments · Fixed by #380
Closed

Can we get batch data use df.to_pandas() in the case of big data? #345

kxbin opened this issue Mar 25, 2021 · 7 comments · Fixed by #380
Labels
enhancement New feature or request help wanted Solution is fleshed out and ready to be worked on

Comments

@kxbin
Copy link
Contributor

kxbin commented Mar 25, 2021

Hello, everybody, There are one question about DataFrame.to_pandas() API

Can we get batch data use df.to_pandas() in the case of big data?

For example:
There are 100 million rows data in Elasticsearch
Then please look at the code below

df = ed.DataFrame(...)         // Now, df have 100 million rows
df.to_pandas(show_progress=True) // This will very slowly, because Elasticsearch only can return 10,000 rows per query

Can we use like this?

iterator = df.to_pandas_by_batch(batch_size=1000)
for batch in iterator:
    df_batch = batch.to_pandas()

Is there any other way to do the same things? Thanks!

@sethmlarson
Copy link
Contributor

This is a great feature suggestion, would you be interested in implementing this? I can assist if needed.

@sethmlarson sethmlarson added enhancement New feature or request help wanted Solution is fleshed out and ready to be worked on labels Jul 31, 2021
@kxbin
Copy link
Contributor Author

kxbin commented Aug 3, 2021

Yeah, I am very happy to implement it and pull a requests.

Related to this pull requests:
Add to_pandas_in_batch() DataFrame API #369

Please help to see if the code conforms to the specification, Thanks for your help!

This is a great feature suggestion, would you be interested in implementing this? I can assist if needed.

@NickolayVasilishin
Copy link

Hi guys!
Not exactly related, but still regarding big data - did you think about parallelized scans of ES? Like to have something like multiprocessing.Pool with ES sliced scans?
I was looking into Eland sources and saw that you've just eliminated scans from the code.

@sethmlarson
Copy link
Contributor

@NickolayVasilishin Thanks for the suggestion. Chatted with the team about this and I have a few things to report:

  • Using sliced scroll for retrieving documents isn't faster than normal scroll if the documents are going to the same place. Sliced scroll is more to make the multiple workers use-case better.
  • Slices are coming to Point-in-Time searches in 7.15: Support search slicing with point-in-time elasticsearch#74457 but due to the above point we probably won't see performance gains by using them.

@NickolayVasilishin
Copy link

@sethmlarson thanks for the reply.

Yes, exactly, so that's why I'm talking about having a multiprocessing.Pool for example.
So for large scans (e.g. >1M documents) one can set something like to_pandas(parallelism=10) to utilize 10 cores with subprocesses retrieving results independently and collecting them into a single resulting pandas.DataFrame.

Currently, I'm patching eland.operations.search_yield_hits function in my project with my parallelized function in order to achieve that and would be happy to have this possibility on API level.

@sethmlarson
Copy link
Contributor

sethmlarson commented Aug 11, 2021

@NickolayVasilishin Gotcha, I assumed that's what you meant but maybe I should try doing some testing myself over large data sets. I'll report back with my findings, thanks! Also I wouldn't recommend depending on anything in the eland.operations module, that's a private API (maybe we need to mark that better).

@NickolayVasilishin
Copy link

@sethmlarson thanks! I'd be happy to help with that.

Yes, it's pretty clear that this function is private and patching is very dangerous in terms of versions compatibility, I think, so no need for additional marks on that.
However, I needed that functionality to test Eland in my project, so I take that risk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Solution is fleshed out and ready to be worked on
Projects
None yet
3 participants