Cloud storage #1582

yuchenZhangTG · 2023-05-28T23:32:24Z

Support COPY from different filesystem (AWS S3, GCP GCS Storage)
copy Comment from "gs://kuzubi/example.csv";
Support COPY from compressed CSV files
copy Comment from "gs://kuzubi/example.csv.gz";

I have read and agree to the terms under CLA.md.

yuchenZhangTG · 2023-05-30T19:32:02Z

Emmm, my changes require system installed curl.

I see your test is on self-host machines. Is it possible to install CURL on that.

mewim · 2023-05-30T19:36:20Z

Hi @yuchenZhangTG we would like to ensure that Kùzu can be built without installing any third party dependencies manually. If your change requires libcurl, could you please bring it to third_party?

yuchenZhangTG · 2023-05-31T17:54:12Z

Hi @yuchenZhangTG we would like to ensure that Kùzu can be built without installing any third party dependencies manually. If your change requires libcurl, could you please bring it to third_party?

As mentioned in https://arrow.apache.org/docs/dev/developers/cpp/building.html,

google_cloud_cpp_storage: for Google Cloud Storage support, requires system cURL and can use the BUNDLED method described below

System CURL is required by AWSSDK and google_cloud_cpp_storage. It seems custom build CURL cannot be detected in the dependency build of Apache arrow.

The dependency build generate its own cmake commands and environment variables. I tried to pass in different -D and environemnt variables, but it does not work.

Also I tried to add CURL in the external folder next to Apache arrow. Since third_party is built after apache arrow.

@mewim How do you think.

ray6080

Hi @yuchenZhangTG , this feature looks awesome! We do hope to integrate with cloud storage well and have some ideas in our roadmap (though not publicly written down yet).
As for this PR, can you suggest a good way for testing? Should we have a private bucket for some of our testing dataset on S3/GCS, and deploy secrets on our CI machine?

Btw, feel free to join our slack workspace here (https://join.slack.com/t/kuzudb/shared_invite/zt-1w0thj6s7-0bLaU8Sb~4fDMKJ~oejG_g) :smile We'd like to learn more from your use cases with this feature!

yuchenZhangTG · 2023-06-01T05:25:15Z

@ray6080
The current implementation uses Filesystem module in Apache Arrow. But it requires system installed libcurl-dev, and it seems that custom built CURL not work.

For testing, yeah, it would be great if you set up your own AWS and GCP account, and have private buckets for testing. The cost for cloud storage is not expensive and you should be able to cover the cost. It is better to have secrets hosted on your CI test machines (Sometimes, people put secrets in the Github code, it is convenient, but it is not safe). As far as I know, GCP does not charge for downloading, and AWS only charge for cross-region download. It is also OK to make buckets public if you want to share the use case on the Kuzu cloud (the one on your website demo).

For cloud import
I loaded 180MB snappy parquet using 11s from local SSD and 14s from GCS bucket. So I would say copying from GCS is pretty fast.
For compressed CSV
I tested on Google Cloud Storage (GCS) and loaded SF30 LDBC BI dataset and import comment.csv.gz files. it worked.

ray6080 · 2023-06-01T17:31:40Z

@ray6080 The current implementation uses Filesystem module in Apache Arrow. But it requires system installed libcurl-dev, and it seems that custom built CURL not work.

For testing, yeah, it would be great if you set up your own AWS and GCP account, and have private buckets for testing. The cost for cloud storage is not expensive and you should be able to cover the cost. It is better to have secrets hosted on your CI test machines (Sometimes, people put secrets in the Github code, it is convenient, but it is not safe). As far as I know, GCP does not charge for downloading, and AWS only charge for cross-region download. It is also OK to make buckets public if you want to share the use case on the Kuzu cloud (the one on your website demo).

For cloud import I loaded 180MB snappy parquet using 11s from local SSD and 14s from GCS bucket. So I would say copying from GCS is pretty fast. For compressed CSV I tested on Google Cloud Storage (GCS) and loaded SF30 LDBC BI dataset and import comment.csv.gz files. it worked.

Thanks! @yuchenZhangTG . This looks promising. We will set up a testing pipeline for this asap after our coming release next week.
To have a work around for the third party dependency problem for now, is it possible to set up a new cmake option ENABLE_HTTPS, and pass it into the arrow cmake file? So we don't introduce extra system installed dependency when the option is not enabled.
In the longer term, we will set up an extension mechanism similar to what DuckDB does, and offload cloud part to extensions.
Let us know if you have better ideas!

yuchenZhangTG · 2023-06-01T19:44:01Z

The option in DuckDB is HTTPFS, so it is HTTP (support http) + FS (support AWS S3). My feature does not support HTTP but S3 and GCS. So it is better call it ENABLE_FILESYSTEM.

DuckDB write their own HTTPFS extension. The extension is separatable from their package, and user install it using load httpfs command. However, for Kuzu, where we can simply utilize file system in Apache Arrow, we need to reload the whole arrow package.

Another solution
For building from source, we turn on FileSystem in Arrow by default. If CURL package is not available we turn off FileSystem and give a warning message.
If ENABLE_FILESYSTEM is true and CURL is not available, we throw an error, and ask for sudo installation of CURL lib.

For pre-compiled package, I suggested the build with filesystem.

How do you think ? @ray6080

ray6080 · 2023-06-02T02:37:11Z

The option in DuckDB is HTTPFS, so it is HTTP (support http) + FS (support AWS S3). My feature does not support HTTP but S3 and GCS. So it is better call it ENABLE_FILESYSTEM.

You're right. How about we be more explicit and call this ENABLE_CLOUD_FS? so we can differential it from the local fs.

DuckDB write their own HTTPFS extension. The extension is separatable from their package, and user install it using load httpfs command. However, for Kuzu, where we can simply utilize file system in Apache Arrow, we need to reload the whole arrow package.

Yep. I agree that it's not a good idea to do the fs extension with arrow. We have a battle plan to remove arrow inside the team, that's why i mentioned the extension idea.

Another solution For building from source, we turn on FileSystem in Arrow by default. If CURL package is not available we turn off FileSystem and give a warning message. If ENABLE_FILESYSTEM is true and CURL is not available, we throw an error, and ask for sudo installation of CURL lib.

This sounds good to me. @mewim what do you think?

For pre-compiled package, I suggested the build with filesystem.

Agreed!

mewim · 2023-06-02T16:25:38Z

@ray6080 @yuchenZhangTG This sounds good to me. I double checked and it seems that we have installed curl on the CI runners:

kuzu/scripts/dockerized-ci-tests-runner/Dockerfile

Line 6 in 5b10a0f

    
           RUN apt-get update && apt-get install -y g++ gcc clang-14 python3-dev python3-pip python-is-python3 cmake nodejs jq curl apt-transport-https gnupg sudo git clang-format-11 ca-certificates lsb-release wget lcov

I am not sure why the previous build did not go through. Is there another library required to make this work?

yuchenZhangTG · 2023-06-04T07:08:07Z

@mewim
I mean curllib not curl.

sudo apt-get install libcurl4-openssl-dev

Build not work because libcurl4-openssl-dev is missing.

ray6080

Thanks @yuchenZhangTG ! The PR looks good to me. I hope to take another quick look before we merge.
Btw, can you take a look at our CLA file? And please add "I have read and agree to the terms under CLA.md." to the PR description if you're fine with it. Thanks!

src/storage/copier/table_copy_utils.cpp

Makefile

src/binder/bind/bind_copy.cpp

src/include/binder/binder.h

src/include/storage/copier/table_copy_executor.h

external/arrow/apache_arrow.cmake

src/common/copier_config/copier_config.cpp

mewim · 2023-06-06T18:19:28Z

I have redeployed all the CI runners with libcurl

codecov · 2023-06-08T02:37:46Z

Codecov Report

Patch coverage: 57.74% and project coverage change: -0.10 ⚠️

Comparison is base (1e9d4fd) 91.19% compared to head (c9f5e21) 91.10%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1582      +/-   ##
==========================================
- Coverage   91.19%   91.10%   -0.10%     
==========================================
  Files         779      779              
  Lines       28606    28654      +48     
==========================================
+ Hits        26088    26104      +16     
- Misses       2518     2550      +32

Impacted Files	Coverage Δ
src/include/binder/binder.h	`100.00% <ø> (ø)`
src/include/common/copier_config/copier_config.h	`100.00% <ø> (ø)`
src/include/storage/copier/table_copy_utils.h	`100.00% <ø> (ø)`
src/common/copier_config/copier_config.cpp	`51.85% <40.00%> (-36.39%)`	⬇️
src/storage/copier/table_copy_utils.cpp	`71.53% <60.00%> (-1.45%)`	⬇️
src/binder/bind/bind_copy.cpp	`76.84% <61.11%> (-8.88%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

ray6080 · 2023-06-08T03:00:26Z

Thanks @yuchenZhangTG . Can you manually squash your commits into one? That's a convention we're following. Then I think it's ready to merge!

yuchenZhangTG · 2023-06-08T04:09:36Z

Thanks @yuchenZhangTG . Can you manually squash your commits into one? That's a convention we're following. Then I think it's ready to merge!

Sounds great.

ray6080

@yuchenZhangTG It seems there is something wrong with the asan test run, which lasts unusually long. Better let's figure out what happens before merging.

Meanwhile, I've set up a public GCS bucket here (https://storage.googleapis.com/kuzu-test/tinysnb) with csv files under here. Can you verify if this is good for use and help us to set up a test case on this?
My idea is we create a new dataset directory tinysnb-cloud under our dataset directory with the schema.cypher and copy.cypher file, but replace COPY statements in copy.cypher. Then I will add a new test file under our copy tests, which will automatically trigger copying the dataset on cloud into our database and run a bunch of queries.
Does this sound good to you?

yuchenZhangTG · 2023-06-16T18:14:01Z

@ray6080
I see the copy tests are moved to end2end. Is ASAN still checked here.

I added a GCS test case in the end2end copy test. It seems that GCS anonymous access need to add a username anonymous before the bucket name
e.g.:
COPY person FROM "gs://anonymous@kuzu-test/tinysnb/vPerson.csv"(HEADER=true);

benjaminwinger · 2023-07-12T22:31:51Z

Google cloud cpp is at least getting to the build stage, but it's failing for some reason still (it was working for me locally). I also still need to finish making the test be run only when the feature is enabled.

On windows, the only way it seems that I could get it to detect curl in all of the kuzu build, the arrow build, and the google-cloud-cpp build, was defining CURL_INCLUDE_DIR and CURL_LIBRARY and passing those values to the arrow build, which then had to be patched because, while it's passing the corresponding values for OpenSSL to google cloud cpp, it's not for CURL. This patch should probably be upstreamed (patch is written as a python file so that it can fail if cmake applies it multiple times without breaking the build).
For some reason, backslashes in the CURL_* paths aren't being normalized to forward slashes and are causing build errors. It works fine if you just use forward slashes in those paths.

benjaminwinger · 2023-07-14T21:22:01Z

This also seems to be cause the arrow build to approach the maximum path length on windows. I suspect that the rust remote-fs feature will be difficult to build on windows as a result.

But shortening the build directory paths a little helped, and it's succeeding now with the exception of coverage. Edit: actually it looks like there are a bunch of asan tests which are failing here but not on the master branch

ray6080 · 2023-09-14T04:54:55Z

@yuchenZhangTG @benjaminwinger I'm proposing to close this PR as this has been a while, and we've now shifted to a new csv reader, instead of arrow. So this PR is not aligned with it anymore. Happy to re-open it if any of you are interested in continuing on this with the new csv reader, though a new series of PR for that sounds more reasonable.

ray6080 · 2023-09-15T03:12:05Z

Close this for now as it is outdated for a while.

yuchenZhangTG marked this pull request as draft May 30, 2023 08:16

yuchenZhangTG marked this pull request as ready for review May 30, 2023 08:31

ray6080 requested review from ray6080 and mewim May 30, 2023 13:11

yuchenZhangTG force-pushed the cloud-storage branch from 7476aaa to 1ae20a7 Compare May 30, 2023 19:30

ray6080 reviewed Jun 1, 2023

View reviewed changes

yuchenZhangTG force-pushed the cloud-storage branch 2 times, most recently from cd04183 to eca9c5a Compare June 4, 2023 21:40

yuchenZhangTG requested a review from ray6080 June 6, 2023 08:10

ray6080 reviewed Jun 6, 2023

View reviewed changes

mewim reviewed Jun 6, 2023

View reviewed changes

external/arrow/apache_arrow.cmake Outdated Show resolved Hide resolved

mewim reviewed Jun 6, 2023

View reviewed changes

src/common/copier_config/copier_config.cpp Show resolved Hide resolved

yuchenZhangTG force-pushed the cloud-storage branch 2 times, most recently from 846d8e9 to cfed035 Compare June 8, 2023 04:07

ray6080 approved these changes Jun 8, 2023

View reviewed changes

ray6080 reviewed Jun 8, 2023

View reviewed changes

benjaminwinger force-pushed the cloud-storage branch 6 times, most recently from fd67d73 to 001d7aa Compare July 12, 2023 22:22

benjaminwinger force-pushed the cloud-storage branch 4 times, most recently from 86c82be to 865e3e2 Compare July 14, 2023 19:38

benjaminwinger mentioned this pull request Jul 14, 2023

Remove openssl linking for tests #1819

Merged

yuchenZhangTG and others added 13 commits July 18, 2023 16:51

Cloud Storage

6fe0c2c

Add remote-fs rust feature

e1c19da

Update copy_gcs test syntax

8d3b8f0

Fix typos

1f33427

Add ENABLE_REMOTE_FS option and fix linking on windows

e3ff30f

Update msvc CI job to point to curl library

1c50c3a

Set ENABLE_REMOTE_FS explicitly in gcc test

4852cd3

Patch arrow to pass curl options to gcs

a9f5e6a

Enable GCS test only if remote fs support is enabled

3ff3f54

Switch to static curl library

70855b7

Try shortening build path

19600fd

Fix test

f4be339

Enable remote FS in clang test

c9f5e21

benjaminwinger force-pushed the cloud-storage branch from 7a22537 to c9f5e21 Compare July 18, 2023 21:03

ray6080 closed this Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud storage #1582

Cloud storage #1582

yuchenZhangTG commented May 28, 2023 •

edited

Loading

yuchenZhangTG commented May 30, 2023

mewim commented May 30, 2023 •

edited

Loading

yuchenZhangTG commented May 31, 2023

ray6080 left a comment

yuchenZhangTG commented Jun 1, 2023

ray6080 commented Jun 1, 2023

yuchenZhangTG commented Jun 1, 2023 •

edited

Loading

ray6080 commented Jun 2, 2023

mewim commented Jun 2, 2023 •

edited

Loading

yuchenZhangTG commented Jun 4, 2023 •

edited

Loading

ray6080 left a comment

mewim commented Jun 6, 2023

codecov bot commented Jun 8, 2023 •

edited

Loading

ray6080 commented Jun 8, 2023

yuchenZhangTG commented Jun 8, 2023

ray6080 left a comment •

edited

Loading

yuchenZhangTG commented Jun 16, 2023

benjaminwinger commented Jul 12, 2023

benjaminwinger commented Jul 14, 2023 •

edited

Loading

ray6080 commented Sep 14, 2023

ray6080 commented Sep 15, 2023

Cloud storage #1582

Cloud storage #1582

Conversation

yuchenZhangTG commented May 28, 2023 • edited Loading

yuchenZhangTG commented May 30, 2023

mewim commented May 30, 2023 • edited Loading

yuchenZhangTG commented May 31, 2023

ray6080 left a comment

Choose a reason for hiding this comment

yuchenZhangTG commented Jun 1, 2023

ray6080 commented Jun 1, 2023

yuchenZhangTG commented Jun 1, 2023 • edited Loading

ray6080 commented Jun 2, 2023

mewim commented Jun 2, 2023 • edited Loading

yuchenZhangTG commented Jun 4, 2023 • edited Loading

ray6080 left a comment

Choose a reason for hiding this comment

mewim commented Jun 6, 2023

codecov bot commented Jun 8, 2023 • edited Loading

Codecov Report

ray6080 commented Jun 8, 2023

yuchenZhangTG commented Jun 8, 2023

ray6080 left a comment • edited Loading

Choose a reason for hiding this comment

yuchenZhangTG commented Jun 16, 2023

benjaminwinger commented Jul 12, 2023

benjaminwinger commented Jul 14, 2023 • edited Loading

ray6080 commented Sep 14, 2023

ray6080 commented Sep 15, 2023

yuchenZhangTG commented May 28, 2023 •

edited

Loading

mewim commented May 30, 2023 •

edited

Loading

yuchenZhangTG commented Jun 1, 2023 •

edited

Loading

mewim commented Jun 2, 2023 •

edited

Loading

yuchenZhangTG commented Jun 4, 2023 •

edited

Loading

codecov bot commented Jun 8, 2023 •

edited

Loading

ray6080 left a comment •

edited

Loading

benjaminwinger commented Jul 14, 2023 •

edited

Loading