Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice on the storage of Pandas DataFrames #215

Closed
aabdullah-bos opened this issue Sep 24, 2018 · 3 comments
Closed

Advice on the storage of Pandas DataFrames #215

aabdullah-bos opened this issue Sep 24, 2018 · 3 comments

Comments

@aabdullah-bos
Copy link

I am interested in using Papermill to generate notebooks to analyze financial data. The example given in the README, gives an example of storing primitive Python types, but does not show or provide guidance on how to "record"/store more complex data types. I am consider storing the data in CSV files or in some type of "database-like" resource.

Do you have any guidance or advice on things that you've learned about data ingress and data storage when using Papermill?

@MSeal
Copy link
Member

MSeal commented Sep 24, 2018

Hi @aabdullah-bos,

Definitely we want to improve data storage in papermill. You can see #175 is suggesting moving that capability into an adjacent project and upgrade the interaction.

In general with what's there today the data has to be serializable to/from JSON for the record operation. Converting pandas dataframes to csv / json and recording that does work (I've used that in a few places myself). You just have to re-encode the data on the consumer side back to data frames knowing how you originally saved the data.

As far as leanings on the topic, it's good for one-off situations or situations where you don't have a readily available database/object store to reference. Handy when you have a chain of work and want to reproducibly store state in isolation of everything else along the way. It's bad when your dataframe is large (>20MB) because most notebook interfaces don't do incremental saves and send the whole notebook to the store on each save (papermill included). It's also not easily understood by most external systems what you're storing and where you're storing it if you need to share more globally. This means if the notebook data is to be used by more than one downstream system, notebook objects/stores should be registering the data they are housing within your pipeline or you should be saving it to such a store (like S3/Azure/your favorite blob store).

Hopefully this is useful insight. Feel free to ask more specific questions if any of that wasn't clear or needs deeper elaboration.

@aabdullah-bos
Copy link
Author

@MSeal Thanks for the response. I've implemented something like the code below...I haven't decided if I want to use JSON or CSV. I'll keep your "learnings", in mind as I work through a few more iterations. I think that I may be leaning towards recording a URL, handle, path or resource locator in the output notebook so that down stream notebooks can access them.

On the producer side

import papermill as pm
import pandas

def compute_some_data():
    df = pandas.DataFrame.from_dict({'a': [1,2,3], 'b': [3,2,1]})
    return df

df = compute_some_data()

# record as JSON
json_data = df.to_json()
pm.record('json_data', json_data)

# record as string
csv_data = df.to_csv()
pm.record('csv_data', csv_data)

On the consumer side

import papermill as pm
import pandas
import json
import io

nb = pm.read_notebook('test_df_json.ipynb')

# read from JSON
json_data = json.loads(nb.dataframe[nb.dataframe.name == 'json_data'].iloc[0]['value'])
df_j = pandas.DataFrame.from_dict(json_data)

# read from CSV
stream = io.StringIO(nb.dataframe[nb.dataframe.name == 'csv_data'].iloc[0]['value'])
df_c = pandas.read_csv(stream, index_col=0)

@MSeal
Copy link
Member

MSeal commented Sep 29, 2018

Awesome, hopefully the points above will help. I'm going to close the issue but feel free to reopen if you have followup concerns/questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants