Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(model): Require size when writing to ProvenanceFileStorage #7292

Merged
merged 1 commit into from
Jul 17, 2023

Conversation

mnonnenmacher
Copy link
Member

Add a size argument to the putData function of the ProvenanceFileStorage which represents the size of the input stream to be written.

This is a precondition for being able to use the PostgreSQL large object storage as a backend for the storage 1 which is a requirement in the ongoing ORT Server development.

The current implementations of ProvenanceFileStorage do not require the size of the input stream, but the size is easily available for all callers of the putData function.

@mnonnenmacher mnonnenmacher requested a review from a team as a code owner July 14, 2023 14:42
@@ -22,6 +22,7 @@ package org.ossreviewtoolkit.model.utils
import io.kotest.core.spec.style.WordSpec
import io.kotest.matchers.shouldBe

import java.io.ByteArrayInputStream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding as "syntactic sugar" a fun putData(provenance, bytes: ByteArray() for not making the caller that verbose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean just for the test code? Because I think the change it FileListResolver is not that verbose.

Copy link
Member

@fviernau fviernau Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for the file list resolver. Anyhow, I still do not understand yet why it is needed at all. Does the commit message intend to reference int lo_read(PGconn *conn, int fd, char *buf, size_t len); ?

I'm asking because for len one should pass the size of the buffer, not the size of the whole stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't enough about the implementation, maybe it's also a requirement of the JDBC driver or so. But after reading some docs I think it should be possible to remove that again later on. Therefore I have rephrased the commit message so that this is basically an ask of the people working on the ORT server to introduce this to unblock us for now pending later investigation if it can be removed again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, there is really no pointer to any API docs which shows that it's needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave the decision whether to merge up to you guys, even tough I prefer to wait until there is a clear reason that it is really required. Because all API docs we've seen now seem to provide a way without the length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I'm also reluctant towards this "dummy" change. @mnonnenmacher, could you maybe explain why it's easier to do this change here, than making the change in the server to not require length?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give you a bit background information:
In ORT Server, we have introduced a generic storage API for storing all kinds of data. It can be backed by different implementations. Postgres is one of those, another one could be a file storage like Azure Storage Accounts. When designing the API we had the different possible implementations in mind, and some of them require the size of the data to be known beforehand. This is for instance needed by the putBlob operation from Azure, where requests require a correct Content-Length header.

For Postgres itself, the data size is not required for writing or reading blobs. However, the implementation in ORT Server uses the size information to distinguish between data that can be read into memory and large data that needs to be streamed to the client. If the size is below a configurable threshold, it is read directly and send as response to the client. Otherwise, we need to work with temporary files to buffer the data; this is because Postgres requires that blobs can only be accessed in a transaction; however, when sending the HTTP response to the client, the transaction is already closed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details @oheger-bosch. So if instead of Postgres the Azure backend would be mentioned in the commit message as an example that really requires the length (like you do above), and also the information from the last paragraph that ORT might use different "modes" depending on the size of the data would go into the commit message, I'd be ok with this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the commit message.

@codecov
Copy link

codecov bot commented Jul 14, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (04849fd) 61.20% compared to head (a0de798) 61.20%.

❗ Current head a0de798 differs from pull request most recent head 5c879da. Consider uploading reports for the commit 5c879da to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #7292   +/-   ##
=========================================
  Coverage     61.20%   61.20%           
  Complexity     1969     1969           
=========================================
  Files           335      335           
  Lines         16487    16488    +1     
  Branches       2350     2350           
=========================================
+ Hits          10091    10092    +1     
  Misses         5419     5419           
  Partials        977      977           
Flag Coverage Δ
test 36.59% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...src/main/kotlin/utils/FileProvenanceFileStorage.kt 63.63% <ø> (ø)
...main/kotlin/utils/PostgresProvenanceFileStorage.kt 92.68% <ø> (ø)
model/src/main/kotlin/utils/FileArchiver.kt 85.71% <100.00%> (ø)
scanner/src/main/kotlin/utils/FileListResolver.kt 85.71% <100.00%> (+0.71%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mnonnenmacher mnonnenmacher force-pushed the provenance-file-storage-add-size branch 2 times, most recently from 76c0f8e to a0de798 Compare July 17, 2023 08:00
Add a `size` argument to the `putData` function of the
`ProvenanceFileStorage` which represents the size of the input stream
to be written.

This is currently a requirement to implement the `ProvenanceFileStorage`
for the ORT Server using a generic storic interface it provides. This
interface requires the size because some potentail storage backends
require this, for example the putBlob [1] method from the Azure blob
storage requires a Content-Length header to be set. The size is also
required by the PostgreSQL implementation of this storage interface
which uses it to decide if the data is directly streamed to the client
or if it needs to be buffered in a file. This is necessary because
PostgreSQL requires that blobs are read in a transaction, but the time
the transaction stays open should not depend on the time it takes to
transfer the data to the client.

The current implementations of `ProvenanceFileStorage` do not require
the size of the input stream, but the size is easily available for all
callers of the `putData` function. However, if for example Azure should
be implemented as a storage backend in future, the size argument would
be required as explained above.

[1]: https://learn.microsoft.com/en-us/rest/api/storageservices/put-blob

Signed-off-by: Martin Nonnenmacher <martin.nonnenmacher@bosch.io>
@mnonnenmacher mnonnenmacher force-pushed the provenance-file-storage-add-size branch from a0de798 to 5c879da Compare July 17, 2023 08:16
@mnonnenmacher mnonnenmacher merged commit 119a58d into main Jul 17, 2023
17 of 19 checks passed
@mnonnenmacher mnonnenmacher deleted the provenance-file-storage-add-size branch July 17, 2023 09:48
@mnonnenmacher
Copy link
Member Author

Merged as the failing GodModFunTest and GitWorkingTreeFunTest are unrelated to the change.

@sschuberth sschuberth added release notes Changes that should be mentioned in release notes breaking API change Pull requests that break compatibility with previous API labels Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking API change Pull requests that break compatibility with previous API release notes Changes that should be mentioned in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants