Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB MIMIC-III concepts implementation #1529

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

SphtKr
Copy link
Contributor

@SphtKr SphtKr commented Apr 28, 2023

These changes expand on #1489 / #1239 by adding a Python version that ingests the data and creates the concepts views in the DuckDB database. The python version requires only the duckdb python package and not the command-line executable. Also requires sqlglot.

Run python import-duckdb.py --help for usage, it's pretty simple.

Tested on the MIMIC-III demo and the full version (1.4)... it seems to work, though I'm not sure if there's a better way to test (i.e. compare actual values... I have the Postgres version running in a container if so).

Resolves #1495

@SphtKr SphtKr changed the title DuckDB concepts implementation DuckDB MIMIC-III concepts implementation Apr 28, 2023
SphtKr and others added 3 commits April 28, 2023 23:57
…es without a large amount of RAM, and this way it can be skipped with `--skip-indexes`
Copy link
Member

@alistairewj alistairewj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts from a scan of the code.

I also think there are a lot of if/else statements littered throughout, but haven't thought about it enough to simplify that yet.

mimic-iii/buildmimic/duckdb/README.md Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to avoid having a separate create table statement for duckdb - for MIMIC-III it's not a big deal (the tables have been static for ages), so this can be left as is.

For MIMIC-IV, we are continuing to update it, so this results in some inconsistencies (which either need to be checked with a gh action, which is hard, or avoided by having one source of SQL DDL for creating tables).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try putting this through sqlglot--I think I made that SQL script before I started using that. At present, there is a bug (referenced in the comment in that file) that causes creation of the table to fail with the primary key in place (index creation causes OOM)... so until that's fixed the table creation will need at least that tweak.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a json used to generate the DDL for postgres/bigquery/mysql/duckdb - but it also seems like a separate PR to duckdb concepts

I suppose the index OOM bug is a duckdb bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the index OOM bug is a duckdb bug?

Yep, it's resolved in their repo but not released yet.

mimic-iii/buildmimic/duckdb/concepts/icustay_hours.sql Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency this file would be better placed in mimic-iii/concepts_duckdb/icustay_hours.sql - this seems doable with the python script

sql = fp.read()
result = connection.execute(sql).fetchall()
for row in result:
print(f"{row[0]}: {row[2]} records ({row[1]} expected) - {row[3]}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will print all fails if we load the demo. Might be nice to have a --demo flag to alternate to a demo validation check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about --skip-checks... should I just make --run-checks (off by default)? Don't love the idea of another set of static values...though I suppose they wouldn't change much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up with a validate.sql and a validate_demo.sql with MIMIC-IV fwiw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though again, the row counts are something that screams "this should be a json somewhere else in the repo", rather than hard-coded into all these destined to rot SQL files


print("Creating tables...")

# ccs_dx is an outlier...this is adapted from the BigQuery version...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it an outlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, comment should probably be more specific. For the record, 1. because it's the only non-query-based table (that gets loaded with data) in the concepts build process, 2. the Postgres DB puts it in the public schema instead of mimiciii... which I am guessing is because it's also applicable to the MIMIC-IV concepts?

#cProfile.run('...')
#print(f"Making view {key}...")
db = concept_name_map[key].get("db", "bigquery")
if db == "duckdb":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessarily abstract. You might as well just hard-code the make_duckdb_query function to do something special for icustay_hours. Or, alternatively, have the subfunction contain a hard-coded list of queries which it parses as duckdb, that way you can expand it to more queries in the future if you plan to move more over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.... earlier I was pulling some queries from the BigQuery version and others from the Postgres version, and Postgres was actually the default...later I got them all to work translated from BigQuery except this one...which now looks odd.

sql = _duckdb_rewrite_schema(sql, schema)

try:
conn.execute(f"CREATE VIEW {qname} AS " + sql)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a drop view if exists -> will be fixed by harmonizing duckdb/bigquery subfunctions

with open(qfile, "r") as fp:
sql = fp.read()
sql = re.sub(_too_many_backslashes_re, '\\$1', sql)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rewrite this merging the bigquery/duckdb subfunctions into one, and having an if statement which calls the sql modification if needed, and then continuing on as usual. this would avoid mild divergences in behaviour observed e.g. forgetting to drop the view for duckdb but doing it for bigquery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably valid.... these two were originally far more divergent but are indeed quite similar now.


conn.execute(f"DROP VIEW IF EXISTS {qname}")
try:
conn.execute(f"CREATE VIEW {qname} AS " + sql)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about creating either materialized views or simply tables? A view would rerun the code over and over (and be slow). I think most users will be happy with trading off the disk space for faster queries, so I'd say the default should be to create tables and if a --view argument is passed it creates views instead (or two arguments, --make-concepts and --make-concepts-as-views).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea. Looks like DuckDB doesn't have materialized views (yet), but for this use case tables are perfectly valid.

@alistairewj
Copy link
Member

Seems like there are a few valid changes, feel free to update in your own time and just ping me with a re-review when you think it's ready

@SphtKr
Copy link
Contributor Author

SphtKr commented May 8, 2023

I have a very interesting problem.

I noticed that the DuckDB version here is producing integer results for DATETIME_DIFF-type operations, e.g. icustay_detail.los_icu, but the PostgreSQL version (implemented in postgres-functions.sql) returns fractional values, e.g. 3.14 days vs. 3 days (the DuckDB version is in fact a floor of the Postgres version, it appears). So I was going to fix that...

Then, since I am actually converting the queries from the BigQuery versions, I consulted the BigQuery documentation for DATETIME_DIFF which says "Returns the whole number of specified part intervals..." so it looks like the BigQuery version of the concepts may differ from the Postgres versions in this area--though I don't have easy access to the BigQuery version to test.

It looks like I can do either in the DuckDB version by changing how the SQL is translated... I may even put in an option to do either... but which one should it be, if there is a "should"? If nothing else I wanted to bring this up if it's an issue to have the discrepancy between the two current versions.

@alistairewj
Copy link
Member

Definitely an inconsistency, good spot. I can confirm BigQuery is giving integers after calculating a floor. Immediately I think the easiest fix is to add a FLOOR() call to each of the postgresql functions which emulate BigQuery behaviour (which can be done in a separate PR). I'll raise an issue for that. I think there are a few queries where this would obviously impact the output as it's generated from a DATETIME_DIFF call (age, icustay_detail), and some queries where it's more subtle (e.g. urine_output_rate).

For duckdb I think return whole integers is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MIMIC-III Concepts for Sqlite3 or DuckDB (probably now DuckDB)
2 participants