Question: Using sqlalchemy-file (or similar) ? #2691

VannTen · 2022-08-24T13:18:57Z

This is a question/discussion/maybe proposal on the Thoth storage model.

I think refuting it could help better understand the current storage model, and
hence document it (#2661).

Why do we not use something like sqlalchemy-file
-> ie, an extension to the sql-alchemy ORM to store files in "file storage"
(including S3) and reference them in the SQL database ?

The question arose from working on thoth-station/document-sync-job#37
and #2674. From what I can tell, we use ad-hoc structuration of the S3
documents, by prefix + by date, and we have a important number of metadata (->
thoth/storages/result_schema.py).

That looks like data which would be more efficiently handled in SQL (the date
stuff, particularly).

Unifying (aka, single source of truth) also looks like some features would
become simpler (such as #2657 with judicious cascading deletion).

I still don't grok fully the current storage model, so it's entirely possible
that I'm completely off-base here. I think articulating the reasons why would
really help #2661.

I didn't find a reasoning browsing through past issues and PR. If there is one,
please point me to it 👍 !

Thanks

/sig stack-guidance

EDIT: changed prospective "file field" project. Although it's really knew, it
reuses Apache libcloud and seems more suited to the purpose.

VannTen · 2022-08-26T07:28:35Z

https://github.com/jowilf/sqlalchemy-file (probably more relevant project)

codificat · 2022-08-29T11:25:19Z

/kind question
/priority important-longterm

mayaCostantini · 2022-08-29T11:44:15Z

From what I can tell, we use ad-hoc structuration of the S3
documents, by prefix + by date, and we have a important number of metadata (->
thoth/storages/result_schema.py).
That looks like data which would be more efficiently handled in SQL (the date
stuff, particularly).

Is the idea here to implement a more generic file storage adapter to be able to eventually migrate documents from S3 to another object storage service?
If that is the case, why not, but I just have two objections to this generalization:

I don't think sqlalchemy-media would be a good candidate given that the package has not been updated since 2019, and it is unclear to me if the adoption of file storage solutions within SQLAlchemy by the community is strong enough for us to rely on.
That looks like data which would be more efficiently handled in SQL

Is there any evidence to support this? I don't know if referring to documents through SQL would drastically improve the efficiency of document storage and retrieval vs. the implementation time and effort it could require.

VannTen · 2022-08-29T12:58:58Z

Is the idea here to implement a more generic file storage adapter to be able to eventually migrate documents from S3 to another object storage service?

Not exactly, at least, it's not the main benefits from my POV. Rather,
it's about closer integration of file storage in the SQL models.

From my understanding, the idea behind sqlalchemy-media is to basically
have something like this in SQL models:

class SomeModel(Base):
    
    document = FileField() # accessed like a FileObject
    date = DateField
    doc_type = EnumSomething

The ORM plugin handle the machinery (= storing a reference to the object storage
id and uploading/retrieving it).

If that is the case, why not, but I just have two objections to this generalization:

I don't think sqlalchemy-media would be a good candidate given that the package has not been updated since 2019, and it is unclear to me if the adoption of file storage solutions within SQLAlchemy by the community is strong enough for us to rely on.

Yeah, I agree. It was more meant as a illustration of the idea.

That looks like data which would be more efficiently handled in SQL

Is there any evidence to support this? I don't know if referring to documents through SQL would drastically improve the efficiency of document storage and retrieval vs. the implementation time and effort it could require.

I was mainly thinking about how we handle metadata.
For the dates examples, instead of iterating through date prefixes for
examples, we would just do the equivalent of SELECT FROM doc_table where creation_date >= date.
So it's more a "less code" efficiency than a "better perf" (sorry, that was not
clear).

Another random thing:

Do we have an average size for the documents (I'm currently searching through
the docs) ? I understand it's JSON, but are we around a few kB ? a few MB ?
Would the JSONB field type in postgres be of any relevance ?

mayaCostantini · 2022-08-29T13:22:45Z

The ORM plugin handle the machinery (= storing a reference to the object storage
id and uploading/retrieving it).

Thanks for the example, this is more clear to me now.

So it's more a "less code" efficiency than a "better perf" (sorry, that was not
clear).

Ok, I see how that could be useful. However I think the current implementation of models might be clear enough and I am not sure if mixing up file storage handling and tables implementation is a good idea readability-wise.

@harshad16 wdyt?

VannTen · 2022-08-29T13:30:20Z

I am not sure if mixing up file storage handling and tables implementation is a good idea readability-wise.

Should we implement this, storages should not do any file storage handling at all, the plugin should handle it. If it (=handling file storage) bleeds into storages code, that would just be the same (but differently ^) than now with more dependencies, so yeah that would be useless.

mayaCostantini · 2022-08-29T13:31:16Z

Another random thing:
Do we have an average size for the documents (I'm currently searching through
the docs) ? I understand it's JSON, but are we around a few kB ? a few MB ?
Would the JSONB field type in postgres be of any relevance ?

It would be useful (and not too time-consuming I guess) to have an estimation for that, but from memory, I know that some documents take up a lot of memory and it would be inconvenient to store them directly in the database via JSONB. In this case, I don't think we should start storing any document directly in the database.

VannTen · 2022-08-29T13:37:09Z

It would be useful (and not too time-consuming I guess) to have an estimation for that

I'll open an issue in metrics-exporter (unless maybe it's already there) to see if we can come up with something (probably).

sesheta added the sig/stack-guidance Categorizes an issue or PR as relevant to SIG Stack Guidance. label Aug 24, 2022

sesheta added kind/question Categorizes issue or PR as a support question. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 29, 2022

This was referenced Aug 29, 2022

Metrics on document sizes thoth-station/metrics-exporter#869

Open

Developper documentation relation postgres <--> Ceph #2658

Open

VannTen changed the title ~~Question: Using sqlalchemy-media (or similar) ?~~ Question: Using sqlalchemy-file (or similar) ? Sep 5, 2022

VannTen mentioned this issue Jan 30, 2023

Have a single source of truth model for storages #2767

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Using sqlalchemy-file (or similar) ? #2691

Question: Using sqlalchemy-file (or similar) ? #2691

VannTen commented Aug 24, 2022 •

edited

Loading

VannTen commented Aug 26, 2022

codificat commented Aug 29, 2022

mayaCostantini commented Aug 29, 2022

VannTen commented Aug 29, 2022

mayaCostantini commented Aug 29, 2022

VannTen commented Aug 29, 2022 via email

mayaCostantini commented Aug 29, 2022

VannTen commented Aug 29, 2022 via email

Question: Using sqlalchemy-file (or similar) ? #2691

Question: Using sqlalchemy-file (or similar) ? #2691

Comments

VannTen commented Aug 24, 2022 • edited Loading

VannTen commented Aug 26, 2022

codificat commented Aug 29, 2022

mayaCostantini commented Aug 29, 2022

VannTen commented Aug 29, 2022

mayaCostantini commented Aug 29, 2022

VannTen commented Aug 29, 2022 via email

mayaCostantini commented Aug 29, 2022

VannTen commented Aug 29, 2022 via email

VannTen commented Aug 24, 2022 •

edited

Loading