Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB #205

Open
tpoisot opened this issue Oct 11, 2023 · 3 comments
Open

DuckDB #205

tpoisot opened this issue Oct 11, 2023 · 3 comments

Comments

@tpoisot
Copy link
Member

tpoisot commented Oct 11, 2023

Tagging @cboettig - would you be open to sharing the code snippet you showed me after the workshop?

@cboettig
Copy link

I dropped an example in tpoisot/InterpretableSDMWithJulia#13 but maybe would be easier to paste in the thread.

A few extra notes:

  • duckdb can do this remote access on http urls as well, not just S3 (arrow does some similar stuff but without full SQL and only for S3). It also works with CSV files, not just parquet, and the read method can take a vector of file paths. This is nice, for instance, because with NEON on Google Cloud Storage now, we can have a remote connection to all the csv files in some data product without downloading a single one. And of course you can do authenticated S3 access when the data can't be re-distributed (looking at you, eBird)
  • The duckdb spatial extension are quite extensive. This works with geoparquet, but also with plain parquet or csv files with a lat/lon columns since you can just mutate a geometry column on it and operate against that.
using DuckDB
# create a new in-memory database
con = DBInterface.connect(DuckDB.DB, ":memory:")

# create a table
DBInterface.execute(con, "INSTALL httpfs;")
DBInterface.execute(con, "LOAD httpfs;")

DBInterface.execute(con, 
"CREATE OR REPLACE TEMPORARY VIEW gbif 
 AS SELECT * FROM 
 read_parquet('s3://gbif-open-data-us-east-1/occurrence/2023-10-01/occurrence.parquet/**')
 ;")

# Note that we can do aribtrary SQL operations, including
# geoparquet spatial queries, with minimal RAM footprint,
# over the full gbif

results = DBInterface.execute(con, 
"SELECT *
FROM gbif
WHERE
  (scientificname = 'Procyon lotor') AND
  (decimallatitude BETWEEN -80.0 AND -58.501) AND
  (decimallongitude BETWEEN 41.0 AND 51.999)")

@tpoisot
Copy link
Member Author

tpoisot commented Oct 12, 2023

I definitely need to read more about this, but if it is a viable alternative to the streaming API, I'm all for it. Do you know if there's a similar endpoint for the taxonomy?

@cboettig
Copy link

Looks like that's the only one they post to AWS open data, but we could ask the GBIF team about that. The current taxonomic backbone (and other snapshots) can be downloaded as a tsv already from https://hosted-datasets.gbif.org/datasets/backbone/current/ -- if it weren't in a zip archive we could hit it directly from duckdb (gz compression would have been fine).

For now I just dropped that copy on my MINIO server so I think this should work:

DBInterface.execute(con, 
"CREATE OR REPLACE TEMPORARY VIEW taxon 
AS SELECT * FROM 
read_csv('https://minio.carlboettiger.info/gbif/Taxon.tsv', 
         delim='\\t', header=true, quote='',  auto_detect=true);")

taxon = DBInterface.execute(con, "SELECT * FROM Taxon")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants