Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to Store Data Artifacts for Pythia? #20

Open
norlandrhagen opened this issue Feb 22, 2023 · 10 comments
Open

Where to Store Data Artifacts for Pythia? #20

norlandrhagen opened this issue Feb 22, 2023 · 10 comments
Assignees
Labels
infrastructure Infrastructure related issue

Comments

@norlandrhagen
Copy link
Collaborator

norlandrhagen commented Feb 22, 2023

Hi there,

I have a general Pythia infrastructure question. In some sections of the Kerchunk-Cookbook, we would like to demonstrate how to open up a pre-generated Kerchunk reference file for a large virtual dataset. We can host this reference .json or .parquet on a CarbonPlan cloud account for now, but we were wondering if there is a preferred location to host these artifacts. Ideally, it seems that these could live in some Pythia bucket, so they are connected to the project. Not sure if there are any resources to host these, but just wondering what everyone's thoughts were. The file size should be quite small, probably one or two files in the 10's to 100's MB range.

Thanks!

cc @maxrjones @brian-rose

@brian-rose
Copy link
Member

That's a great question. We have the pythia-datasets repo/package that we've used to house example data for Foundations tutorials. That's certainly a possibility. The datasets themselves are stored within the repo (which works ok for small things) and they can be accessed from notebooks with a lightweight API based on pooch, see this Foundations example

We're exploring the idea of requesting Pythia storage buckets via Open Storage Network to host larger ARCO datasets that we can build significant Cookbook content on, but that's still just an idea at this point.

@ktyle this would be great to put on the agenda for the next IWG meeting.

@ktyle
Copy link
Contributor

ktyle commented Feb 22, 2023

Yes, will put on the agenda. @norlandrhagen this could also be a good candidate for a Pangeo Forge recipe.

@norlandrhagen
Copy link
Collaborator Author

Thanks for the reply @brian-rose and @ktyle. Would love to hear what comes out of next weeks meeting.

@ktyle. Definitely a good call! I'm not super savvy on the details, but I think the current storage for Pangeo-Forge is tied into the currently NSF grant, which ends in roughly a year. Also, the Reference Recipe functionality isn't implemented in the beam-refactor branch of Pangeo-Forge. Hopefully when that gets finished, I can make some Kerchunk-based Pangeo-Forge reference recipes.

@ktyle
Copy link
Contributor

ktyle commented Feb 22, 2023

@norlandrhagen I wondered how long the Pangeo Forge storage will remain available. Good to know. Our next infrastructure working group meeting is Mon 3/6 (next week is our Education working group; these two working groups meet on alternate Mondays). 😄

@brian-rose brian-rose added the infrastructure Infrastructure related issue label Apr 27, 2023
@brian-rose
Copy link
Member

We are going to push for some better guidance on this prior to our summer 2023 hackathon.

@dcamron
Copy link

dcamron commented May 15, 2023

Thanks for letting me hijack this issue for a more general answer. At today's IWG meeting we agreed on our recommendations for cookbook data artifacts, in loose order of preference:

  1. rely on data that is already freely available and usable, accessible with tools in the ecosystem; point to Foundations or other cookbooks for tool how-to guides if needed
  2. focus on representative subsets of data that can be packaged alongside the cookbook in-repo
  3. wait for Pythia team to explore cloud storage support via NSF JetStream or adjacent efforts (a project priority, but yet to be started until cookbook execution infrastructure project is complete)
  4. provide the tools and/or clear documentation for accessing data authors have stored somewhere accessible themselves

We'll leave this issue open for discussion and until we document these suggestions with more instruction across the project, eg in the cookbook contributor's guide.

@brian-rose
Copy link
Member

More general discussion of the data storage issue: ProjectPythia/cookbook-gallery#155

@ktyle
Copy link
Contributor

ktyle commented Sep 3, 2024

@norlandrhagen are you still interested in hosting some data artifacts from the kerchunk cookbook? If so, please let me know the approximate size in total. I can then advise on how best you can transfer your data to Project Pythia's object store hosted on Jetstream2.

@norlandrhagen
Copy link
Collaborator Author

Hey @ktyle! I think the only real artifact we have stored is a ~272 Mb Kerchunk parquet reference.

@ktyle
Copy link
Contributor

ktyle commented Sep 30, 2024

@norlandrhagen is it s3://carbonplan-share/nasa-nex-reference/reference_catalog_nested.csv? If that's all there is, we might as well close this for now. However if you end up needing a home for larger datasets, we can discuss housing them on our object store on JS2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Infrastructure related issue
Projects
Status: To-Do
Development

No branches or pull requests

6 participants