Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that chunker sizes are in bytes #5923

Merged
merged 1 commit into from
Jan 21, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions core/commands/add.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,12 +78,16 @@ You can now refer to the added file in a gateway, like so:

The chunker option, '-s', specifies the chunking strategy that dictates
how to break files into blocks. Blocks with same content can
be deduplicated. The default is a fixed block size of
be deduplicated. Different chunking strategies will produce different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We might want to move the pros/cons of chunking to a new paragraph (this is pretty choppy). However, your change is still an improvement so we can mess with that later if you'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is one of the most important things to know about this option and chunking. I moved it up so people would read it early before they get distracted by all the details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My complaint is that it's "fact, fact, fact..." with no relationship between the facts. But didn't flow all that well before so we can fix this later.

hashes for the same file. The default is a fixed block size of
256 * 1024 bytes, 'size-262144'. Alternatively, you can use the
rabin chunker for content defined chunking by specifying
rabin-[min]-[avg]-[max] (where min/avg/max refer to the resulting
chunk sizes). Using other chunking strategies will produce
different hashes for the same file.
Rabin fingerprint chunker for content defined chunking by specifying
rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired
chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.

The following examples use very small byte sizes to demonstrate the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1k isn't that small and users definitely shouldn't use 2048*1024 (2MiB) chunks. Let's just explain the example (e.g., "For example, to chunk a file into 1KiB chunks...").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

1 kiB is ridiculously small in this context. You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks. The protocol overhead would be enormous. Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks.

You won't have to do that. You'll go through the DHT for the root node (which will likely be smaller than 1KiB as it doesn't contain any actual data).

Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

Most filesystem have 4KiB blocks. Also note that we're moving towards a datastore that doesn't store each chunk in a separate file. However, I do agree that a 1KiB is small.

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

Ah, yeah, you're right. We might want to just bump those sizes up instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I'd be fine merging this as-is.

Copy link
Contributor

@hinshun hinshun Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that we're moving towards a datastore that doesn't store each chunk in a separate file.

@Stebalien Where can I find more information about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

properties of the different chunkers on a small file. You'll likely
want to use a 1024 times larger chunk sizes for most files.

> ipfs add --chunker=size-2048 ipfs-logo.svg
added QmafrLBfzRLV4XSH1XcaMMeaXEUhDJjmtDfsYU95TrWG87 ipfs-logo.svg
Expand Down