Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++/python] Enable optional global-order writes for sparse arrays #2564

Merged
merged 6 commits into from
May 20, 2024

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented May 15, 2024

Issue and/or context: #2054

The C++-ification of the SOMA API now allows us to utilize global writes when writing to sparse arrays, thus saving time by skipping the coordinate sorting step when the coordinates are already known to be ordered.

Changes:

  • Add sort_coords setting in the PlatformConfig, where by default, it is set to False and thus still uses TILEDB_UNORDERED
  • Add sort_coords parameter to SOMAArray::write and ManagedQuery::submit_write

@nguyenv nguyenv changed the title [c++/python] Allow global writes for sparse arrays [c++/python] Enable global writes for sparse arrays May 15, 2024
Copy link
Contributor

@eddelbuettel eddelbuettel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the non-Python side of things

Copy link

codecov bot commented May 15, 2024

Codecov Report

Attention: Patch coverage is 90.47619% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 90.24%. Comparing base (34c4b8a) to head (e9b4374).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2564      +/-   ##
==========================================
+ Coverage   90.21%   90.24%   +0.02%     
==========================================
  Files          37       37              
  Lines        3975     4038      +63     
==========================================
+ Hits         3586     3644      +58     
- Misses        389      394       +5     
Flag Coverage Δ
python 90.24% <90.47%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 90.24% <90.47%> (+0.02%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@johnkerl johnkerl changed the title [c++/python] Enable global writes for sparse arrays [c++/python] Enable optional global-order writes for sparse arrays May 15, 2024
Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguyenv this is a great step forward! Global-order writes are awesome.

But the spelling is a bit off. I went to give this two test drives: one with writing to local disk; the other with writing to a cloud URI. I didn't do the second yet but realized I have some feedback for you after doing the first.

Namely, I checked out this branch & put some prints in to make sure I was getting sort_coords when I asked for them.

Using a performance script we've been looking at this week
https://gist.github.com/johnkerl/d5002b994caccdfb059efba081f28bb5
I put in a "sort_coords": False line into the platform_config.

Then I noticed that sort_coords was not being used in the code, even though I'd included it in my platform_config in the above script.

Then I realized that I needed to now pass platform_config to write. So far so good.

However, our current nesting is like

platform_config =  {
    "tiledb": {
        "create": {
            "key": "value",
            ...
        },
    }
}

and then we use TileDBCreateOptions.from_platform_config(platform_config) to extract out the create_options = platform_config["tiledb"]["create"] subdict and turn it into more of a struct-accessible form so devs can do create_options.key: for example:

https://github.com/single-cell-data/TileDB-SOMA/blob/1.11.1/apis/python/src/tiledbsoma/_sparse_nd_array.py#L131

What confused me -- and is confusing -- is that I had to (a) put sort_coords into my create subkey, and (b) pass that to write.

What we should do is:

  • Keep TileDBCreateOptions for pulling create-time parameters out of platform_config["tiledb"]["create"]
  • Make a new TileDBWriteOptions for pulling write-time parameters out of platform_config["tiledb"]["write"]
  • Users need to be instructed how to set their platform_config["tiledb"]["write"]["sort_coords"] = False

Other that that, I like this! It's a step forward in that it's still 'off by default' and we can subsequently look into connecting this automatically for the times it's correct to automatically do so (CSR and CSC inputs) and not when it's not (pa.Table input to SparseNDArray's write).

@nguyenv
Copy link
Member Author

nguyenv commented May 15, 2024

Thanks for the feedback @johnkerl. I'll definitely make changes so that this goes into write rather than create, and it's more clear to the user how to set this up in the platform config.

@johnkerl
Copy link
Member

Also: I think we should consider the sort_coords to be an override on what we do

  • For CSR/CSC -- unless they explicitly say sort_coords=True then we should not sort the coords -- since we can guarantee ordering
  • For pa.Table/COO -- then we should sort the coords unless they assert to us the coords are pre-sorted

@nguyenv nguyenv requested a review from johnkerl May 17, 2024 16:44
@nguyenv
Copy link
Member Author

nguyenv commented May 20, 2024

Tested writing to cloud, and it is working. (I had to change the soma_init_buffer_bytes to something smaller, so that's why the write times look faster).

(soma-3.11) vivian@mangonada:~/tiledb-bugs$ python sparse-perf-degradation.py
BP.PY INPUT_PATH  norm4
BP.PY OUTPUT_PATH tiledb://vivian/s3://tiledb-vivian/gow
BP.PY COMPRESSION_LEVEL 3
BP.PY BUFFER_SIZE 2097152
BP.PY BATCH_OUT   None

BP.PY OPENING  norm4
BP.PY OPENED   norm4 SECONDS = 0.013
BP.PY CREATING tiledb://vivian/s3://tiledb-vivian/gow
BP.PY CREATED  tiledb://vivian/s3://tiledb-vivian/gow SECONDS = 4.385
BP.PY READING  norm4 batch 0
BP.PY READ     norm4 batch 0 SECONDS = 0.064
BP.PY WRITING  tiledb://vivian/s3://tiledb-vivian/gow batch 0
BP.PY WROTE    tiledb://vivian/s3://tiledb-vivian/gow batch 0 SECONDS = 1.823
BP.PY READ     norm4 batch 1 SECONDS = 0.012
BP.PY WRITING  tiledb://vivian/s3://tiledb-vivian/gow batch 1
BP.PY WROTE    tiledb://vivian/s3://tiledb-vivian/gow batch 1 SECONDS = 1.909
BP.PY READ     norm4 batch 2 SECONDS = 0.010
BP.PY WRITING  tiledb://vivian/s3://tiledb-vivian/gow batch 2
BP.PY WROTE    tiledb://vivian/s3://tiledb-vivian/gow batch 2 SECONDS = 1.891
BP.PY READ     norm4 batch 3 SECONDS = 0.011

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!!!!

@nguyenv nguyenv merged commit 2c91d42 into main May 20, 2024
25 checks passed
@nguyenv nguyenv deleted the viviannguyen/enable-global-order-writes branch May 20, 2024 22:00
github-actions bot pushed a commit that referenced this pull request May 20, 2024
…2564)

* Add `TileDBWriteOptions`; Set defaults depending on input
* Add assertion checks as reminder to remove warnings in 1.13
* Use `submit_and_finalize` for gow to support on cloud
johnkerl pushed a commit that referenced this pull request May 21, 2024
…2564) (#2598)

* Add `TileDBWriteOptions`; Set defaults depending on input
* Add assertion checks as reminder to remove warnings in 1.13
* Use `submit_and_finalize` for gow to support on cloud

Co-authored-by: nguyenv <vivian@tiledb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants