Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud storage #1582

Closed
wants to merge 13 commits into from
Closed

Conversation

yuchenZhangTG
Copy link

@yuchenZhangTG yuchenZhangTG commented May 28, 2023

  1. Support COPY from different filesystem (AWS S3, GCP GCS Storage)
    copy Comment from "gs://kuzubi/example.csv";
  2. Support COPY from compressed CSV files
    copy Comment from "gs://kuzubi/example.csv.gz";

I have read and agree to the terms under CLA.md.

@yuchenZhangTG yuchenZhangTG marked this pull request as draft May 30, 2023 08:16
@yuchenZhangTG yuchenZhangTG marked this pull request as ready for review May 30, 2023 08:31
@ray6080 ray6080 requested review from ray6080 and mewim May 30, 2023 13:11
@yuchenZhangTG
Copy link
Author

Emmm, my changes require system installed curl.

I see your test is on self-host machines. Is it possible to install CURL on that.

@mewim
Copy link
Member

mewim commented May 30, 2023

Hi @yuchenZhangTG we would like to ensure that Kùzu can be built without installing any third party dependencies manually. If your change requires libcurl, could you please bring it to third_party?

@yuchenZhangTG
Copy link
Author

Hi @yuchenZhangTG we would like to ensure that Kùzu can be built without installing any third party dependencies manually. If your change requires libcurl, could you please bring it to third_party?

As mentioned in https://arrow.apache.org/docs/dev/developers/cpp/building.html,

google_cloud_cpp_storage: for Google Cloud Storage support, requires system cURL and can use the BUNDLED method described below

System CURL is required by AWSSDK and google_cloud_cpp_storage. It seems custom build CURL cannot be detected in the dependency build of Apache arrow.

The dependency build generate its own cmake commands and environment variables. I tried to pass in different -D and environemnt variables, but it does not work.

Also I tried to add CURL in the external folder next to Apache arrow. Since third_party is built after apache arrow.

@mewim How do you think.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yuchenZhangTG , this feature looks awesome! We do hope to integrate with cloud storage well and have some ideas in our roadmap (though not publicly written down yet).
As for this PR, can you suggest a good way for testing? Should we have a private bucket for some of our testing dataset on S3/GCS, and deploy secrets on our CI machine?

Btw, feel free to join our slack workspace here (https://join.slack.com/t/kuzudb/shared_invite/zt-1w0thj6s7-0bLaU8Sb~4fDMKJ~oejG_g) :smile We'd like to learn more from your use cases with this feature!

@yuchenZhangTG
Copy link
Author

@ray6080
The current implementation uses Filesystem module in Apache Arrow. But it requires system installed libcurl-dev, and it seems that custom built CURL not work.

For testing, yeah, it would be great if you set up your own AWS and GCP account, and have private buckets for testing. The cost for cloud storage is not expensive and you should be able to cover the cost. It is better to have secrets hosted on your CI test machines (Sometimes, people put secrets in the Github code, it is convenient, but it is not safe). As far as I know, GCP does not charge for downloading, and AWS only charge for cross-region download. It is also OK to make buckets public if you want to share the use case on the Kuzu cloud (the one on your website demo).

For cloud import
I loaded 180MB snappy parquet using 11s from local SSD and 14s from GCS bucket. So I would say copying from GCS is pretty fast.
For compressed CSV
I tested on Google Cloud Storage (GCS) and loaded SF30 LDBC BI dataset and import comment.csv.gz files. it worked.

@ray6080
Copy link
Contributor

ray6080 commented Jun 1, 2023

@ray6080 The current implementation uses Filesystem module in Apache Arrow. But it requires system installed libcurl-dev, and it seems that custom built CURL not work.

For testing, yeah, it would be great if you set up your own AWS and GCP account, and have private buckets for testing. The cost for cloud storage is not expensive and you should be able to cover the cost. It is better to have secrets hosted on your CI test machines (Sometimes, people put secrets in the Github code, it is convenient, but it is not safe). As far as I know, GCP does not charge for downloading, and AWS only charge for cross-region download. It is also OK to make buckets public if you want to share the use case on the Kuzu cloud (the one on your website demo).

For cloud import I loaded 180MB snappy parquet using 11s from local SSD and 14s from GCS bucket. So I would say copying from GCS is pretty fast. For compressed CSV I tested on Google Cloud Storage (GCS) and loaded SF30 LDBC BI dataset and import comment.csv.gz files. it worked.

Thanks! @yuchenZhangTG . This looks promising. We will set up a testing pipeline for this asap after our coming release next week.
To have a work around for the third party dependency problem for now, is it possible to set up a new cmake option ENABLE_HTTPS, and pass it into the arrow cmake file? So we don't introduce extra system installed dependency when the option is not enabled.
In the longer term, we will set up an extension mechanism similar to what DuckDB does, and offload cloud part to extensions.
Let us know if you have better ideas!

@yuchenZhangTG
Copy link
Author

yuchenZhangTG commented Jun 1, 2023

The option in DuckDB is HTTPFS, so it is HTTP (support http) + FS (support AWS S3). My feature does not support HTTP but S3 and GCS. So it is better call it ENABLE_FILESYSTEM.

DuckDB write their own HTTPFS extension. The extension is separatable from their package, and user install it using load httpfs command. However, for Kuzu, where we can simply utilize file system in Apache Arrow, we need to reload the whole arrow package.

Another solution
For building from source, we turn on FileSystem in Arrow by default. If CURL package is not available we turn off FileSystem and give a warning message.
If ENABLE_FILESYSTEM is true and CURL is not available, we throw an error, and ask for sudo installation of CURL lib.

For pre-compiled package, I suggested the build with filesystem.

How do you think ? @ray6080

@ray6080
Copy link
Contributor

ray6080 commented Jun 2, 2023

The option in DuckDB is HTTPFS, so it is HTTP (support http) + FS (support AWS S3). My feature does not support HTTP but S3 and GCS. So it is better call it ENABLE_FILESYSTEM.

You're right. How about we be more explicit and call this ENABLE_CLOUD_FS? so we can differential it from the local fs.

DuckDB write their own HTTPFS extension. The extension is separatable from their package, and user install it using load httpfs command. However, for Kuzu, where we can simply utilize file system in Apache Arrow, we need to reload the whole arrow package.

Yep. I agree that it's not a good idea to do the fs extension with arrow. We have a battle plan to remove arrow inside the team, that's why i mentioned the extension idea.

Another solution For building from source, we turn on FileSystem in Arrow by default. If CURL package is not available we turn off FileSystem and give a warning message. If ENABLE_FILESYSTEM is true and CURL is not available, we throw an error, and ask for sudo installation of CURL lib.

This sounds good to me. @mewim what do you think?

For pre-compiled package, I suggested the build with filesystem.

Agreed!

@mewim
Copy link
Member

mewim commented Jun 2, 2023

@ray6080 @yuchenZhangTG This sounds good to me. I double checked and it seems that we have installed curl on the CI runners:

RUN apt-get update && apt-get install -y g++ gcc clang-14 python3-dev python3-pip python-is-python3 cmake nodejs jq curl apt-transport-https gnupg sudo git clang-format-11 ca-certificates lsb-release wget lcov
I am not sure why the previous build did not go through. Is there another library required to make this work?

@yuchenZhangTG
Copy link
Author

yuchenZhangTG commented Jun 4, 2023

@mewim
I mean curllib not curl.

sudo apt-get install libcurl4-openssl-dev

Build not work because libcurl4-openssl-dev is missing.

@yuchenZhangTG yuchenZhangTG force-pushed the cloud-storage branch 2 times, most recently from cd04183 to eca9c5a Compare June 4, 2023 21:40
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yuchenZhangTG ! The PR looks good to me. I hope to take another quick look before we merge.
Btw, can you take a look at our CLA file? And please add "I have read and agree to the terms under CLA.md." to the PR description if you're fine with it. Thanks!

src/storage/copier/table_copy_utils.cpp Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
src/binder/bind/bind_copy.cpp Outdated Show resolved Hide resolved
src/include/binder/binder.h Outdated Show resolved Hide resolved
src/include/storage/copier/table_copy_executor.h Outdated Show resolved Hide resolved
@mewim
Copy link
Member

mewim commented Jun 6, 2023

I have redeployed all the CI runners with libcurl

@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Patch coverage: 57.74% and project coverage change: -0.10 ⚠️

Comparison is base (1e9d4fd) 91.19% compared to head (c9f5e21) 91.10%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1582      +/-   ##
==========================================
- Coverage   91.19%   91.10%   -0.10%     
==========================================
  Files         779      779              
  Lines       28606    28654      +48     
==========================================
+ Hits        26088    26104      +16     
- Misses       2518     2550      +32     
Impacted Files Coverage Δ
src/include/binder/binder.h 100.00% <ø> (ø)
src/include/common/copier_config/copier_config.h 100.00% <ø> (ø)
src/include/storage/copier/table_copy_utils.h 100.00% <ø> (ø)
src/common/copier_config/copier_config.cpp 51.85% <40.00%> (-36.39%) ⬇️
src/storage/copier/table_copy_utils.cpp 71.53% <60.00%> (-1.45%) ⬇️
src/binder/bind/bind_copy.cpp 76.84% <61.11%> (-8.88%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ray6080
Copy link
Contributor

ray6080 commented Jun 8, 2023

Thanks @yuchenZhangTG . Can you manually squash your commits into one? That's a convention we're following. Then I think it's ready to merge!

@yuchenZhangTG yuchenZhangTG force-pushed the cloud-storage branch 2 times, most recently from 846d8e9 to cfed035 Compare June 8, 2023 04:07
@yuchenZhangTG
Copy link
Author

Thanks @yuchenZhangTG . Can you manually squash your commits into one? That's a convention we're following. Then I think it's ready to merge!

Sounds great.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuchenZhangTG It seems there is something wrong with the asan test run, which lasts unusually long. Better let's figure out what happens before merging.

Meanwhile, I've set up a public GCS bucket here (https://storage.googleapis.com/kuzu-test/tinysnb) with csv files under here. Can you verify if this is good for use and help us to set up a test case on this?
My idea is we create a new dataset directory tinysnb-cloud under our dataset directory with the schema.cypher and copy.cypher file, but replace COPY statements in copy.cypher. Then I will add a new test file under our copy tests, which will automatically trigger copying the dataset on cloud into our database and run a bunch of queries.
Does this sound good to you?

@yuchenZhangTG
Copy link
Author

@ray6080
I see the copy tests are moved to end2end. Is ASAN still checked here.

I added a GCS test case in the end2end copy test. It seems that GCS anonymous access need to add a username anonymous before the bucket name
e.g.:
COPY person FROM "gs://anonymous@kuzu-test/tinysnb/vPerson.csv"(HEADER=true);

@benjaminwinger benjaminwinger force-pushed the cloud-storage branch 6 times, most recently from fd67d73 to 001d7aa Compare July 12, 2023 22:22
@benjaminwinger
Copy link
Collaborator

Google cloud cpp is at least getting to the build stage, but it's failing for some reason still (it was working for me locally). I also still need to finish making the test be run only when the feature is enabled.

On windows, the only way it seems that I could get it to detect curl in all of the kuzu build, the arrow build, and the google-cloud-cpp build, was defining CURL_INCLUDE_DIR and CURL_LIBRARY and passing those values to the arrow build, which then had to be patched because, while it's passing the corresponding values for OpenSSL to google cloud cpp, it's not for CURL. This patch should probably be upstreamed (patch is written as a python file so that it can fail if cmake applies it multiple times without breaking the build).
For some reason, backslashes in the CURL_* paths aren't being normalized to forward slashes and are causing build errors. It works fine if you just use forward slashes in those paths.

@benjaminwinger
Copy link
Collaborator

benjaminwinger commented Jul 14, 2023

This also seems to be cause the arrow build to approach the maximum path length on windows. I suspect that the rust remote-fs feature will be difficult to build on windows as a result.

But shortening the build directory paths a little helped, and it's succeeding now with the exception of coverage. Edit: actually it looks like there are a bunch of asan tests which are failing here but not on the master branch

@ray6080
Copy link
Contributor

ray6080 commented Sep 14, 2023

@yuchenZhangTG @benjaminwinger I'm proposing to close this PR as this has been a while, and we've now shifted to a new csv reader, instead of arrow. So this PR is not aligned with it anymore. Happy to re-open it if any of you are interested in continuing on this with the new csv reader, though a new series of PR for that sounds more reasonable.

@ray6080
Copy link
Contributor

ray6080 commented Sep 15, 2023

Close this for now as it is outdated for a while.

@ray6080 ray6080 closed this Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants