-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duckdb R client uses excessive RAM and crashes #72
Comments
How much RAM does the environment have? The aggregation likely runs out of memory here. We still need to add an out-of-core aggregation operator. |
Thanks! I'm running on a machine with 32 GB RAM |
This won't fix the problem, but with DuckDB 0.3.3 or above you could simplify your connection code to local parquet files: parquet <- file.path("/home/shared-data/ebird/observations/*.parquet")
conn <- DBI::dbConnect(duckdb(), "/home/shared-data/ebird/db", config=list("memory_limit"="12GB"))
obs <- tbl(conn, paste0("read_parquet('", parquet, "')")) And it may be possible to avoid crash in the arrow-version by giving a memory-limited connection to path <- arrow::s3_bucket("ebird/observations", endpoint_override = "minio.carlboettiger.info")
conn <- DBI::dbConnect(duckdb(), "/home/shared-data/ebird/db", config=list("memory_limit"="12GB"))
obs <- arrow::open_dataset(path) |> to_duckdb(conn) The download-link you gave was not (yet) working, so I couldn't test with the actual data. FYI, in case it would be useful to you, there is an unofficial DuckDB R package (updated after each merge to DuckDB master branch that also contains DuckDB extensions which can be easily compiled with the default R toolchain) available in my r-universe: |
Thanks @rsund ! hmm, thanks for sharing the above snytax, that's cool that you don't need to explicitly declare the creation of a Also I've fixed my S3 bucket address, sorry (forgot to include the subdir for month)
Still crashing though, I suspect because of what @hannes pointed out already about the aggregation. Is there an existing issue thread I should watch for that? 👀 |
Well, you should not use conn <- DBI::dbConnect(duckdb(), ":memory:", config=list("memory_limit"="12GB", "temp_directory" = "/tmp")) or use a file-based database: conn <- DBI::dbConnect(duckdb(), "database.duckdb", config=list("memory_limit"="12GB")) It is also unclear to me how Arrow handles memory allocation for online data sources and when queries for those are run by DuckDB there may be in the worst case a need for having the whole required columns (temporarily) in memory. In principle, that should not be the case as data is scanned in chunks and the excess memory consumption will be related only to the aggregate step that is not yet tuned for using only limited memory. In this particular case it should be quite straightforward to "manually" run the aggregate query in chunks if there is some sensible variable to be used in splitting the data into pieces. But certainly it is more work and shouldn't be needed here as both Arrow and DuckDB are expected to take care of larger than memory data. About the "new" features in R client: |
Thanks @rsund , these details are super helpful to me. Re using Yeah, it's also unclear to me how arrow handles memory allocations in this context. FWIW, as you probably know the same queries are possible in 'pure arrow' without the use of duckdb, but result in similar OOM crashes there, though perhaps for quite different reasons. (e.g. https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531027#comment-17531027). I'm a huge fan of these packages but like you say, being able to seamlessly work with larger-than-RAM data is really the killer feature here. Thanks also for the heads up that the |
Arrow runs into the same problem. For DuckDB, out-of-core capability is absolutely one of our main goals, its just not supported yet in every operator. The suggestion to use the R tempdir automatically is interesting. |
@lnkuiper if you needed any reason to look at the aggregates again ^^ |
I think one approach could be to summarize over some partition, then aggregate, at least for simple counts. Often arrow files are partitioned by something. However, arrow doesn't have an intrinsic chunking method (by row number) that I'm aware of beyond the splitting of parquet files. duckdb offers traditional windows so perhaps specifying a chunk size (nlines), windowing, and then mapping over the windows with a final summarization step would be appropriate. I don't know if this belongs inside of the duckdb R package or rather as an ad-hoc piece of code. |
@hannes et al any news on out-of-core capability in aggregates? In some news, at least I can now reproduce the crash in pure SQL without the arrow wrapper for the S3 bit 😉 : INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';
COPY (
SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM read_parquet("s3://ebird/observations/*")
)
GROUP BY sampling_event_identifier, scientific_name
)) TO 'output.parquet' (FORMAT PARQUET); This is a public bucket, so if I use the https:// addressing and limit to something considerably less than all 1124 partitions in the dataset, it doesn't crash. as such I can probably hack together the query by working over partitions in chunks manually but that seems less than ideal... |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
Any news on out-of-core capability in this case, or ability to leverage temp dir to avoid OOM kills on these operations? (as you already know, the above operations still lead to the OOM killer even in pure SQL. |
@cboettig Not sure, but potentially this PR will address your issue? |
@Tmonster Thanks for flagging me! I just tested the examples above again having installed from github, but sadly duckdb still just consumes all available RAM and crashes |
Maybe also something where @lnkuiper could have a look - can we make a pure SQL repro for this? |
I think the pure SQL example above should still be hitting a public-read bucket. INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';
COPY (
SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM read_parquet("s3://ebird/observations/*")
)
GROUP BY sampling_event_identifier, scientific_name
)) TO 'output.parquet' (FORMAT PARQUET); lemme know if that's not working for you? |
Hi, I've been able to reproduce it. It seems to be a bug indeed. I'm working on narrowing down the exact issue, but a current workaround would be to
create table observations as select * from read_parquet("s3://ebird/observations/*");
create table answer as SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM observations
)
GROUP BY sampling_event_identifier, scientific_name
);
SET preserve_insertion_order=false;
copy answer to 'output.parquet' (FORMAT PARQUET); If you have a persistent database or set a temporary directory for an in-memory database this should work. Worked on my laptop with 16GB. |
Thanks much for this @Tmonster , it's great to be able to work around the issue this way. I'm still keen on having a way for users to execute this remotely without having to create the local copy, but it's really nice for this to be unstuck. Please share if you've had any further luck in pinpointing the exact issue that causes this query to OOM |
@Tmonster maybe I spoke too soon? this is still hitting an OOM error on my machine with 48 GB RAM. Maybe this is a secondary issue with how back-pressure is implemented, since I'm running inside a container capped at 48GB on a host machine with 64 GB RAM. I am setting both a tempdir path PRAGMA and a max memory limit PRAGMA. Running on Ubuntu 22.04 on amd64 arch. (Here's the R version, though note it's just passing the SQL as above, I don't think R is really involved in the issue)
|
Hmmm, I'm not positive, but there may be a difference in setting up a tempdir vs connecting to a persistent database file Can you try replacing the second line with |
good idea but no luck there, still consumes all available RAM (even on my larger machine with 120 GB) and crashes |
Just gathering some more potentially useful info: Also something worth trying to further troubleshoot the issue: Can you reproduce the issue using the CLI ? |
Thanks -- great ideas here. I did try running in the CLI, and setting threads to 1 with |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Hi Simon, I think what you are seeing here is the effect of DuckDB materializing the data from the table, allocating the R data frame, then copying the data into R. DuckDB doesn't know how much space it should allocate until it has all of the data, so first all the data needs to be materialized in DuckDB's memory (~16GB). Then DuckDB needs to allocate an R data frame to hold all the memory (~also 16GB) so that R can use the data. Once Duckdb passes off the data, it lives only in R, which is why you see 16GB of memory in use again. Interesting that DuckDB doesn't provide any feedback that the memory limit has been reached in this case. That is something I can look into. |
Thanks! Is it
or |
Sorry, my mention of feedback wasn't clear. To clarify, feedback from DuckDB would be some kind of notification to the user that DuckDB requires more memory than what is specified by the memory_limit. Normally if duckdb attempts to use more memory than specified by the memory limit, an We have an internal issue to track this |
gotcha , thanks again. Just to make sure, there is no way to read/write data from duckdb database to R without using twice as much as RAM as the full data size at some point, correct? |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@krlmlr still an issue though it may be a different one from @SimonCoulombe 's. I think my example above is a symptom of not all the aggregation methods being implemented out-of-core yet? as noted above, I believe my example is still crashing out of ram when executed as pure SQL without R involved. |
Thanks! @cboettig: If this can be replicated with the command-line client, an issue in the upstream repo https://github.com/duckdb/duckdb/ would be useful. @SimonCoulombe: Is your issue still a problem with the latest dev version? You can also install from r-universe: https://duckdb.r-universe.dev/duckdb# . If yes, let's discuss this in a new issue. |
@krlmlr thanks, it's been a while since I checked this in the CLI client. Just tested now with the identical code I provided above and it is working just fine in the cli, so it is indeed now just an R issue at this point. I'm not using DBI writeTable explicitly above though, so I don't know if this is or isn't the same issue as @SimonCoulombe is reporting. Again, here's the code that I run that works in the CLI, but fails when called via R. In the CLI, it uses just above 30GB (RES use peaks around 35.7 GB, VIRT use is higher but I understand that doesn't matter so much). Works fine in the CLI client, crashes in R.
|
Thanks. @cboettig, @SimonCoulombe: Is this still a problem with #90, which contains duckdb 0.10.0? |
on it -- let me check |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
@cboettig: From looking at your example, I see no reason why it shouldn't work with the current dev version, soon to hit CRAN. That version contains exactly the same core as the v0.10.0 release. I have split your script into an importing and a processing stage, the importing stage is now at about 2 GB with the progress still being at 0%. I wonder how much data this example contains. More important though is the question if we can downsize the example so that it still shows the problem without having to wait that long. |
Thanks much @krlmlr , really appreciate this. Yes, this a large example (easy to check using s3, e.g. > library(minioclient)
> mc_alias_set("ex", "minio.carlboettiger.info", "", "")
Added `ex` successfully.
> mc_du("ex/ebird/observations/")
152GiB 1124 objects ebird/observations I'd love to find a smaller example that illustrates the problem. You could certainly run the command any one or any subset of the 1124 parquet files ( |
Thanks. The memory limit can be as small as 10 MB or a small multiple, that's already an awful lot of memory blocks that the database engine can shuffle around. True, some operations don't work at all if the memory limit is too aggressive, but perhaps there's a sweet spot? If you still have a duckdb file with the data, could you try running the query with the most recent duckdb package from r-universe? Chances are it'll just work.
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I stand corrected -- a 10 MB memory limit doesn't seem to be a good simulation of reality, perhaps 125 MB per thread are more realistic? |
still testing with this. I'm not entirely clear on difference between the options
are these just aliases for the same thing? Also from https://duckdb.org/docs/configuration/pragmas#memory-limit , it sounds like these aren't enforced on certain aggregation operations
though I need to look more closely to understand if/when I encounter any of those cases. |
What happens?
duckdb crashes when trying to perform queries on a large (~150 GB) parquet database
To Reproduce
Apologies I cannot reproduce this with a smaller dataset. I can reproduce the same errors using the remote S3 access though which might be a bit faster than downloading a full copy of the dataset, or you could probably download a copy from my server in a few days. reproducible examples of both remote and local-based access below. (The remote example goes via arrow because I cannot get direct access to S3 buckets via duckdb alone yet).
Or, after downloading from the above public S3 bucket (https://minio.carlboettiger.info/ebird/Mar-2022/observations), try local parquet access:
With the memory limit PRAGMA in place, the example doesn't crash the R session but does throw an OOM error:
Environment (please complete the following information):
Before Submitting
master
branch?pip install duckdb --upgrade --pre
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
by the way,
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
no longer works as there seems to no longer be master-builds download?The text was updated successfully, but these errors were encountered: