Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't lowercase blob extensions #3227

Closed

Conversation

dotlambda
Copy link
Collaborator

Some programs require specific casing, e.g. .Rmd.

fixes deltachat/deltachat-desktop#2685

cc @dumblob

Some programs require specific casing, e.g. `.Rmd`.
Copy link
Member

@r10s r10s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to check all calls to suffix() and sanitise_name() and check if all subsequent comparisons are case-insensitive (and take care if the suffix is stored somewhere else and also check these code parts).

also this pr removes some of the "sanitize" aspects: extensions are not really sanitized any longer, things as image.JPG are "wronger" than image.jpg ... while, in general, it seems to be good to keep the original file name - the function is as it is esp. for received files, that are easily messed up completely, all upper case etc by some mua, bad characters etc. so, keeping the original name is not always doable anyway.

so, all in all - not sure about this pr. maybe we should have a positive list of extensions that really use case and skip the lowercasing only for them.

cc @flub, who did the original blob-implementation with lowercasing extensions on purpose: are there any other reasons we've overseen? not that we mess up other things by fixing things for some rarely used extension.

@@ -298,9 +298,6 @@ impl<'a> BlobObject<'a> {
}

/// Returns the extension of the blob.
///
/// If a blob's filename has an extension, it is always guaranteed
/// to be lowercase.
pub fn suffix(&self) -> Option<&str> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is at least one place i know of that relies on the old definition of the suffix to be lowercase: there is a comparison against the webxdc extension

@hpk42
Copy link
Contributor

hpk42 commented Apr 17, 2022

haven't looked at the involved code parts for longer but conceptually i'd expect that we keep and re-use the "display" filename on receiving attachments, and when sending them out, use those original displaynames. That we normalize the names for use as keys in databases, or for determining filenames is an orthogonal thing. For the purposes for @dotlambda's PR at hand here, i agree with @r10s that making ".R" and ".Rnd" into exceptions when lower-casing might be an option to remedy the immediate issue at hand.

@dumblob
Copy link

dumblob commented Apr 17, 2022

i agree with @r10s that making ".R" and ".Rnd" into exceptions when lower-casing might be an option to remedy the immediate issue at hand.

Let me note that now that I (and some of my friends) know about this issue in DC, we actually do not need any urgent fix. So I'd prefer if this got solved properly without workarounds despite the fix taking (much) more time to implement.

@dumblob
Copy link

dumblob commented Nov 7, 2022

Friendly ping if there is anything new going on here 😉.

@flub
Copy link
Member

flub commented Nov 9, 2022

@r10s how are attachments even exposed on the C API? I guess this is really where this matters, internally it does not so much. I had a quick look but couldn't find this in dc_msg_t

@flub
Copy link
Member

flub commented Nov 9, 2022

Personally I could be found for somehow entirely preserving original file names. Most MUAs behave this way, and OSes have always had to deal with whatever crazy filenames come out of this. Most people will have files named .JPG and .JPEG somewhere on their computer or device. I'm not sure deltachat needs to have an opinion on what is valid. It can just pass along some bytes.

That of course is slightly more work, but there is already the context of potentially removing the blobdir in favour of storing blobs in some other storage mechanism.

@r10s
Copy link
Member

r10s commented Nov 9, 2022

how are attachments even exposed on the C API?

dc_msg_get_file(), dc_chat_get_profile_image(), dc_get_config(selfavatar) are the ones for reading blob files.

EDIT: leaving the name completely as is, however, would only be doable if we store things not directly in the file system; sth. that may come along with iroh, however, this is a bit farer away and needs some more things to consider. so, maybe for now, chose the most pragmatic approach that does not make things more difficult in the future.

so, yes, maybe not lowercasing as suggesting in this pr is okay, however, more things need adaptions then and required no-case-comparison. some of these places are mentioned above, at least these should be targeted before merging - and other potential places should be checked.

for using blobs via ffi or jsonrpc: i think that this is not a huge issue, i agree that most instances processing the files futher will deal with the weird name variants.

@dumblob
Copy link

dumblob commented Nov 10, 2022

This is probably not the best place to ask - but why let this functionality depend on IPFS/iroh (now or in the future)? Or have I misunderstood?

I mean, IPFS protocol is known to be bandwidth-hungry (a significant issue for mobile world), rather high-latency and actually have other downsides (NAT traversal is still not mature as we speak etc.) compared to e.g. faster and generally better Hypercore (which shall also have slightly nicer API).

@hpk42
Copy link
Contributor

hpk42 commented Dec 4, 2022

This is probably not the best place to ask - but why let this functionality depend on IPFS/iroh (now or in the future)? Or have I misunderstood?

This issue is not about IPFS, so it's indeed better discussed elsewhere. At this point, there are only experiments like #3489 to introduce parts of libp2p and IPFS for multi-device setup. Iroh is a Rust implementation that performs a lot better. As far as i know there is no fully working Hypercore Rust implementation so it's not an option (feel free to investigate). However, it's really better to discuss this out of this PR -- on community channels. Hope that clarifies a bit.

@hpk42
Copy link
Contributor

hpk42 commented Dec 4, 2022

EDIT: leaving the name completely as is, however, would only be doable if we store things not directly in the file system;

Managing blob metadata in the database would help keeping original filenames. The filenames of blob data could stay exactly as they are now, as internal sanitized names. By itself, it's currently probably not a high prio refactoring. However, keeping blob metadata would also help with deduplication. On my chat account setup, deduplication would single-handedly save 11% of storage. Just by having multiple rows in a hypothetical blobmeta table point to the (however named) blob-content file.

@dumblob dumblob mentioned this pull request Dec 5, 2022
5 tasks
@hpk42
Copy link
Contributor

hpk42 commented Dec 6, 2022

k, so i created this issue #3817 and close this PR.

@hpk42 hpk42 closed this Dec 6, 2022
@flub
Copy link
Member

flub commented Dec 8, 2022

@dotlambda see also my comment if you like to address this in a shorter term. I indeed agree this doesn't need to wait on a storage rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attachment extension is lowercased for no reason
5 participants