Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Using sqlalchemy-file (or similar) ? #2691

Open
VannTen opened this issue Aug 24, 2022 · 8 comments
Open

Question: Using sqlalchemy-file (or similar) ? #2691

VannTen opened this issue Aug 24, 2022 · 8 comments
Labels
kind/question Categorizes issue or PR as a support question. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.

Comments

@VannTen
Copy link
Member

VannTen commented Aug 24, 2022

This is a question/discussion/maybe proposal on the Thoth storage model.

I think refuting it could help better understand the current storage model, and
hence document it (#2661).


Why do we not use something like sqlalchemy-file
-> ie, an extension to the sql-alchemy ORM to store files in "file storage"
(including S3) and reference them in the SQL database ?

The question arose from working on thoth-station/document-sync-job#37
and #2674. From what I can tell, we use ad-hoc structuration of the S3
documents, by prefix + by date, and we have a important number of metadata (->
thoth/storages/result_schema.py).

That looks like data which would be more efficiently handled in SQL (the date
stuff, particularly).

Unifying (aka, single source of truth) also looks like some features would
become simpler (such as #2657 with judicious cascading deletion).


I still don't grok fully the current storage model, so it's entirely possible
that I'm completely off-base here. I think articulating the reasons why would
really help #2661.

I didn't find a reasoning browsing through past issues and PR. If there is one,
please point me to it 👍 !

Thanks

/sig stack-guidance

EDIT: changed prospective "file field" project. Although it's really knew, it
reuses Apache libcloud and seems more suited to the purpose.

@sesheta sesheta added the sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. label Aug 24, 2022
@VannTen
Copy link
Member Author

VannTen commented Aug 26, 2022

https://github.com/jowilf/sqlalchemy-file (probably more relevant project)

@codificat
Copy link
Member

/kind question
/priority important-longterm

@sesheta sesheta added kind/question Categorizes issue or PR as a support question. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 29, 2022
@mayaCostantini
Copy link
Contributor

From what I can tell, we use ad-hoc structuration of the S3
documents, by prefix + by date, and we have a important number of metadata (->
thoth/storages/result_schema.py).
That looks like data which would be more efficiently handled in SQL (the date
stuff, particularly).

Is the idea here to implement a more generic file storage adapter to be able to eventually migrate documents from S3 to another object storage service?
If that is the case, why not, but I just have two objections to this generalization:

  • I don't think sqlalchemy-media would be a good candidate given that the package has not been updated since 2019, and it is unclear to me if the adoption of file storage solutions within SQLAlchemy by the community is strong enough for us to rely on.
  • That looks like data which would be more efficiently handled in SQL

Is there any evidence to support this? I don't know if referring to documents through SQL would drastically improve the efficiency of document storage and retrieval vs. the implementation time and effort it could require.

@VannTen
Copy link
Member Author

VannTen commented Aug 29, 2022

Is the idea here to implement a more generic file storage adapter to be able to eventually migrate documents from S3 to another object storage service?

Not exactly, at least, it's not the main benefits from my POV. Rather,
it's about closer integration of file storage in the SQL models.

From my understanding, the idea behind sqlalchemy-media is to basically
have something like this in SQL models:

class SomeModel(Base):
    
    document = FileField() # accessed like a FileObject
    date = DateField
    doc_type = EnumSomething

The ORM plugin handle the machinery (= storing a reference to the object storage
id and uploading/retrieving it).

If that is the case, why not, but I just have two objections to this generalization:

  • I don't think sqlalchemy-media would be a good candidate given that the package has not been updated since 2019, and it is unclear to me if the adoption of file storage solutions within SQLAlchemy by the community is strong enough for us to rely on.

Yeah, I agree. It was more meant as a illustration of the idea.

  • That looks like data which would be more efficiently handled in SQL

Is there any evidence to support this? I don't know if referring to documents through SQL would drastically improve the efficiency of document storage and retrieval vs. the implementation time and effort it could require.

I was mainly thinking about how we handle metadata.
For the dates examples, instead of iterating through date prefixes for
examples, we would just do the equivalent of SELECT FROM doc_table where creation_date >= date.
So it's more a "less code" efficiency than a "better perf" (sorry, that was not
clear).


Another random thing:

Do we have an average size for the documents (I'm currently searching through
the docs) ? I understand it's JSON, but are we around a few kB ? a few MB ?
Would the JSONB field type in postgres be of any relevance ?

@mayaCostantini
Copy link
Contributor

The ORM plugin handle the machinery (= storing a reference to the object storage
id and uploading/retrieving it).

Thanks for the example, this is more clear to me now.

So it's more a "less code" efficiency than a "better perf" (sorry, that was not
clear).

Ok, I see how that could be useful. However I think the current implementation of models might be clear enough and I am not sure if mixing up file storage handling and tables implementation is a good idea readability-wise.

@harshad16 wdyt?

@VannTen
Copy link
Member Author

VannTen commented Aug 29, 2022 via email

@mayaCostantini
Copy link
Contributor

Another random thing:
Do we have an average size for the documents (I'm currently searching through
the docs) ? I understand it's JSON, but are we around a few kB ? a few MB ?
Would the JSONB field type in postgres be of any relevance ?

It would be useful (and not too time-consuming I guess) to have an estimation for that, but from memory, I know that some documents take up a lot of memory and it would be inconvenient to store them directly in the database via JSONB. In this case, I don't think we should start storing any document directly in the database.

@VannTen
Copy link
Member Author

VannTen commented Aug 29, 2022 via email

@VannTen VannTen changed the title Question: Using sqlalchemy-media (or similar) ? Question: Using sqlalchemy-file (or similar) ? Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Categorizes issue or PR as a support question. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance.
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants