-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compression level revisit #808
Comments
I’ve recently dealt with an issue in strobealign that made me a bit more sensitive to the relative overhead introduced by using compressed files. It turned out that decompressing (not even compressing) the input FASTQ was preventing us from using more than ~20 threads at a time. Someone contributed a PR that switches to using ISA-L for decompression and decompressing in a separate thread. This now allows us to saturate 128 cores. So I’m inclined to agree the default compression level can be reduced further. What’s your suggestion? (My view is or maybe was still a bit colored by the disk space quota limits I hit regularly. I guess I kind of want to help other people avoid those. But then I also see people storing totally uncompressed FASTQ and even SAM files ...) |
My gut feeling is to use about 10% of the compute time for the compression and compress as good as possible. The problem with gzip is that the decompression can be hardly be multithreaded. Other formats are a bit better at this. On the other hand 1GB/s decompression is quite fast already.
Nice, for paired end data that gives you a 2GB/s input stream right? That's a lot of data to run local alignment on. Do you use any vectorized libraries for the smith-waterman already?
I can relate. Running out of disk space happens frequently here at our institute too. But the 10% extra compression of gzip level 5 compared to level 1 is just not cutting it in that case. if I need to cut the whole WGS run into 4 batches to make sure I don't run into disk space issues, 10% is not helpful. 50% better (files that are 66% the size) helps a lot, because then I can run just 3 batches. In the case of 4 batches, I rather lose 10% extra disk space, if it means my jobs finish a lot faster. It means I can finish the project faster. I concede that this viewpoint is very much coloured by my use case. |
Things have changed since #425:
Running the following command:
/usr/bin/time cutadapt --compression-level X -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ramdisk/out_r1.fastq.gz -p ramdisk/out_r2.fastq.gz ~/test/5millionreads_R1.fastq.gz ~/test/5millionreads_R2.fastq.gz && wc -c ramdisk/*.fastq.gz
Relative to compression level 1
Current defaults:
The text was updated successfully, but these errors were encountered: