Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that chunker sizes are in bytes #5923

Merged
merged 1 commit into from
Jan 21, 2019
Merged

Conversation

da2x
Copy link
Contributor

@da2x da2x commented Jan 13, 2019

Clarify that the Rabin fingerprint chunker is in bytes, and recommend using larger chunk sizes than shown in the provided examples. People are getting confused over chunker sizes and incorrectly assuming they're in kiB and not bytes because of the sizes used in the examples.

Clarify that the Rabin fingerprint chunker is in bytes, and recommend
using larger chunk sizes than shown in the provided examples.

People are getting confused over chunker sizes and incorrectly assuming
they're in kiB and not bytes.

License: MIT
Signed-off-by: Daniel Aleksandersen <code@daniel.priv.no>
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired
chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.
The following examples use very small byte sizes to demonstrate the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1k isn't that small and users definitely shouldn't use 2048*1024 (2MiB) chunks. Let's just explain the example (e.g., "For example, to chunk a file into 1KiB chunks...").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

1 kiB is ridiculously small in this context. You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks. The protocol overhead would be enormous. Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks.

You won't have to do that. You'll go through the DHT for the root node (which will likely be smaller than 1KiB as it doesn't contain any actual data).

Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

Most filesystem have 4KiB blocks. Also note that we're moving towards a datastore that doesn't store each chunk in a separate file. However, I do agree that a 1KiB is small.

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

Ah, yeah, you're right. We might want to just bump those sizes up instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I'd be fine merging this as-is.

Copy link
Contributor

@hinshun hinshun Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that we're moving towards a datastore that doesn't store each chunk in a separate file.

@Stebalien Where can I find more information about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -78,12 +78,16 @@ You can now refer to the added file in a gateway, like so:
The chunker option, '-s', specifies the chunking strategy that dictates
how to break files into blocks. Blocks with same content can
be deduplicated. The default is a fixed block size of
be deduplicated. Different chunking strategies will produce different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We might want to move the pros/cons of chunking to a new paragraph (this is pretty choppy). However, your change is still an improvement so we can mess with that later if you'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is one of the most important things to know about this option and chunking. I moved it up so people would read it early before they get distracted by all the details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My complaint is that it's "fact, fact, fact..." with no relationship between the facts. But didn't flow all that well before so we can fix this later.

@@ -78,12 +78,16 @@ You can now refer to the added file in a gateway, like so:
The chunker option, '-s', specifies the chunking strategy that dictates
how to break files into blocks. Blocks with same content can
be deduplicated. The default is a fixed block size of
be deduplicated. Different chunking strategies will produce different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My complaint is that it's "fact, fact, fact..." with no relationship between the facts. But didn't flow all that well before so we can fix this later.

rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired
chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.
The following examples use very small byte sizes to demonstrate the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t ever want to go through DHT and peer discovery for a file that is split into a thoiusand 1 kiB chunks.

You won't have to do that. You'll go through the DHT for the root node (which will likely be smaller than 1KiB as it doesn't contain any actual data).

Keep in mind that you’re also storing each individual chunk in a file that takes up a 256 kiB block on most file systems, so the disk packing wastage would also be enormous.

Most filesystem have 4KiB blocks. Also note that we're moving towards a datastore that doesn't store each chunk in a separate file. However, I do agree that a 1KiB is small.

I’m never suggesting using 2 MiB chunks. rabin-262144-524288-1048576 is in bytes and not kilobytes, so it’s 128 kiB–256 kiB–512 kiB. This was what this patch was intended to help clarify.

Ah, yeah, you're right. We might want to just bump those sizes up instead.

rabin-[min]-[avg]-[max] (where min/avg/max refer to the desired
chunk sizes in bytes), e.g. 'rabin-262144-524288-1048576'.
The following examples use very small byte sizes to demonstrate the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, I'd be fine merging this as-is.

@Stebalien Stebalien merged commit 5b6f41a into ipfs:master Jan 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants