Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression level revisit #808

Open
rhpvorderman opened this issue Sep 10, 2024 · 2 comments
Open

Compression level revisit #808

rhpvorderman opened this issue Sep 10, 2024 · 2 comments

Comments

@rhpvorderman
Copy link
Collaborator

Things have changed since #425:

  • ISA-L and zlib-ng now have fully functional python bindings.
  • xopen integrates these bindings and automatically chooses an appropriate backend for compression.
  • Both zlib-ng and isa-l have a different compression ratio vs time tradeoff compared to original zlib.
  • Short-read adapter alignment has gotten slightly faster. As a result compression is a larger part of the workload.

Running the following command: /usr/bin/time cutadapt --compression-level X -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ramdisk/out_r1.fastq.gz -p ramdisk/out_r2.fastq.gz ~/test/5millionreads_R1.fastq.gz ~/test/5millionreads_R2.fastq.gz && wc -c ramdisk/*.fastq.gz

Compression level runtime (s) filesize (MiB)
5 (default) 78.4 693
4 69.1 710
3 55.5 740
2 36.6 781
1 36.2 781
0 (no compression in gzip container) 31.8 3405
None (no gzip) 31.0 3405

Relative to compression level 1

Compression level runtime filesize
5 (default) 2.17 0.89
4 1.91 0.91
3 1.53 0.95
2 1.01 1.00
1 1.00 1.00
0 (no compression in gzip container) 0.88 4.36
None (no gzip) 0.86 4.36

Current defaults:

  • Cutadapt: 5
  • Htslib: 5
  • Gatk: 2 (Also uses ISA-L)
  • dnaio: 1
@marcelm
Copy link
Owner

marcelm commented Sep 13, 2024

I’ve recently dealt with an issue in strobealign that made me a bit more sensitive to the relative overhead introduced by using compressed files. It turned out that decompressing (not even compressing) the input FASTQ was preventing us from using more than ~20 threads at a time. Someone contributed a PR that switches to using ISA-L for decompression and decompressing in a separate thread. This now allows us to saturate 128 cores.

So I’m inclined to agree the default compression level can be reduced further. What’s your suggestion?

(My view is or maybe was still a bit colored by the disk space quota limits I hit regularly. I guess I kind of want to help other people avoid those. But then I also see people storing totally uncompressed FASTQ and even SAM files ...)

@rhpvorderman
Copy link
Collaborator Author

My gut feeling is to use about 10% of the compute time for the compression and compress as good as possible.
Using less than 10% of the compute time hardly makes a difference in the overall runtime. Using more seems wasteful to me.
It seems as ISA-L zlib compression manages that at around ~12% of the compute time to give a very small result. So it sort of hits the sweet spot for me. I always use the -Z flag. But I am probably one of the most biased guys on the internet when it comes to compression, don't take my word for it ;-).

The problem with gzip is that the decompression can be hardly be multithreaded. Other formats are a bit better at this. On the other hand 1GB/s decompression is quite fast already.

Someone contributed a PR that switches to using ISA-L for decompression and decompressing in a separate thread. This now allows us to saturate 128 cores.

Nice, for paired end data that gives you a 2GB/s input stream right? That's a lot of data to run local alignment on. Do you use any vectorized libraries for the smith-waterman already?

(My view is or maybe was still a bit colored by the disk space quota limits I hit regularly. I guess I kind of want to help other people avoid those. But then I also see people storing totally uncompressed FASTQ and even SAM files ...)

I can relate. Running out of disk space happens frequently here at our institute too. But the 10% extra compression of gzip level 5 compared to level 1 is just not cutting it in that case. if I need to cut the whole WGS run into 4 batches to make sure I don't run into disk space issues, 10% is not helpful. 50% better (files that are 66% the size) helps a lot, because then I can run just 3 batches. In the case of 4 batches, I rather lose 10% extra disk space, if it means my jobs finish a lot faster. It means I can finish the project faster. I concede that this viewpoint is very much coloured by my use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants