Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer compression writers #78

Merged
merged 4 commits into from
Oct 11, 2021
Merged

Buffer compression writers #78

merged 4 commits into from
Oct 11, 2021

Conversation

rhpvorderman
Copy link
Collaborator

@rhpvorderman rhpvorderman commented Oct 5, 2021

Hi I noticed some weird performance characteristics when using single-threaded compressed writing during developing fastq filter.

These characteristics can easily be reproduced with the following script:

#! /usr/bin/env python3

import sys

import xopen

if __name__ == "__main__":
    threads = sys.argv[1]
    in_file = sys.argv[2]
    out_file = sys.argv[3]
    with open(in_file, "rb") as in_file_h:
        with xopen.xopen(out_file, mode="wb", threads=int(threads),
                         compresslevel=1) as out_file_h:
            for line in in_file_h:
                out_file_h.write(line)

This will default to using IGzipFile when threads=0 and use a PipedPythonIsalWriter when threads=1.

Benchmarks:
No compression for comparison:

Benchmark #1: python xopen_write_lines.py 0  ~/test/big2.fastq ramdisk/out.fastq
  Time (mean ± σ):      2.816 s ±  0.129 s    [User: 2.075 s, System: 0.740 s]
  Range (min … max):    2.698 s …  3.148 s    10 runs

Using a piped python-isal writer:

Benchmark #1: python xopen_write_lines.py 1  ~/test/big2.fastq ramdisk/out.fastq.gz
  Time (mean ± σ):      4.965 s ±  0.090 s    [User: 8.599 s, System: 0.846 s]
  Range (min … max):    4.873 s …  5.186 s    10 runs

User time + system time is roughly 9.4 seconds. 6.6 seconds more than writing to an uncompressed file. That is fair for a 1.6 GB file.

But now single threaded:

Benchmark #1: python xopen_write_lines.py 0  ~/test/big2.fastq ramdisk/out.fastq.gz
  Time (mean ± σ):     33.275 s ±  0.483 s    [User: 32.901 s, System: 0.369 s]
  Range (min … max):   32.671 s … 34.153 s    10 runs

That is a tremendous increase. This is due to GzipFile.write() being tremendously expensive. A compress action is done, but also the length and the checksum are updated. There is a lot of overhead in python calls.

By using an io.BufferedWriter, we create a buffer. And write is only called every 8 KB. That makes a huge difference:

Benchmark #1: python xopen_write_lines.py 0  ~/test/big2.fastq ramdisk/out.fastq.gz
  Time (mean ± σ):      8.822 s ±  0.235 s    [User: 8.485 s, System: 0.335 s]
  Range (min … max):    8.531 s …  9.168 s    10 runs

Note that this time, the single-threaded solution uses less CPU overall (but more wall clock time). This is the behavior that is expected, where multithreading generally has some overhead compared to single thread.

The change is not very invasive and should increase performance across a range of applications that write small records. Most notably cutadapt, because fastq records are generally 335 bytes or so when using standard illumina 150bp reads.

EDIT: I noticed that on cutadapt 3.5 single-threaded xopen mode is impossible to achieve as xopen(..., threads=0) can not ever be called with the current logic (it calls threads=1 when cores==1 and when cores==0 it uses all available cores so xopen is called with threads=4). So cutadapt won't notice this change. I did make a PR rectifying that. (Single-thread FTW!)

This makes writing in small units such as lines or FASTQ records much faster
@rhpvorderman
Copy link
Collaborator Author

Copy link
Collaborator

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had some time to look at this (but I did not re-do your measurements).

This is a bit strange because older xopen versions actually wrapped GzipFiles in io.BufferedWriter, and I remember it was quite important to have that for Python 2.7. I removed this in ce7af05 and 8fba6c6. I don’t know anymore why I made the latter commit, but I tend to think it was because repr(gzip.open("file.gz", "w")) results in "<gzip _io.BufferedWriter name='file.gz' 0x7f31d6c2ea60>", which led me to think that there’s no need to wrap this in another BufferedWriter anymore. But that is wrong as I see now: The BufferedWriter in that message refers to the "inner" GzipFile.fileobj with the raw compressed data.

Good that you caught this! Not merging right now because I have a comment.

What is your thought regarding the Python bug you opened? Maybe that’s too easy, but shouldn’t gzip.open(..., "wb") do exactly what you do here, that is, wrap GzipFile in an io.BufferedWriter? It would be quite symmetric to what regular open("file", "wb") does – that already returns an io.BufferedWriter.

src/xopen/__init__.py Outdated Show resolved Hide resolved
@rhpvorderman
Copy link
Collaborator Author

What is your thought regarding the Python bug you opened? Maybe that’s too easy, but shouldn’t gzip.open(..., "wb") do exactly what you do here, that is, wrap GzipFile in an io.BufferedWriter? It would be quite symmetric to what regular open("file", "wb") does – that already returns an io.BufferedWriter.

Yes I considered that but I didn't do it because it creates a massive problem with backwards-compatibility. gzip.open is expected to return a gzip.GzipFile, many code may have built on that. So unfortunately that is a no go.

@rhpvorderman
Copy link
Collaborator Author

Done. Thanks for the review!

src/xopen/__init__.py Outdated Show resolved Hide resolved
@rhpvorderman rhpvorderman merged commit d5b2f86 into main Oct 11, 2021
@rhpvorderman rhpvorderman deleted the buffercompressionwriter branch October 11, 2021 12:53
@rhpvorderman
Copy link
Collaborator Author

A fix is now also available upstream: python/cpython#101251

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants