Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request - Is there potential to support bgzip? #92

Open
pettyalex opened this issue Feb 25, 2022 · 9 comments
Open

Feature request - Is there potential to support bgzip? #92

pettyalex opened this issue Feb 25, 2022 · 9 comments

Comments

@pettyalex
Copy link

pettyalex commented Feb 25, 2022

This is a clever solution to get the decompression / compression into a separate process, very helpful little tool. This pattern would work well for a couple additional compression tools that my group uses. I'm going to take a look at adding support for these formats, which should be pretty straightforward.

BGZF, or "blocked gzip" is a format that's used pretty widely in bioinformatics, it's basically a lot of gzipped files concatenated together, with some extra info in the headers and an index in a separate file saying where to seek. It's decompressible by normal gzip, so we actually see bgzf files as .gz more often than .bgz. It'd be really great to be able to compress bgz files with xopen as well. The blocked gzip reference implementation is distributed with htslib as a binary called bgzip, and is available both from conda and most linux distros native packages (tabix on Ubuntu, for example).

Also, it'd be great to see this support zstd as well, which is just an excellent general purpose compression tool that I expect to rapidly grow in usage in the next few years.

Edit: To be clear, both of these tools are already usable from Python, there's a bgzip implementation here, and zstd has excellent Python bindings available, but getting the compression into another process like xopen does makes for much better performance.

@marcelm
Copy link
Collaborator

marcelm commented Feb 25, 2022

There’s a draft pull request for zstd support in #91, please also see the discussion over there.

Regarding bgzip support, I think compression support is fine.

Regarding decompression (which I realize you didn’t ask for), I think the support we already have for gzip is sufficient (we have tests that ensure we can decompress concatenated gzip files). I would argue that exposing the seeking capabilities of bgzip wouldn’t be in scope for xopen because the idea is to provide an interface that lets one abstract away the used compression algorithm.

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Feb 25, 2022

I would like to add to that, that BGZF compression is only advantageous when the input data is aligned properly with the BGZF blocks. In the case that you not care for that alignment, it is faster to compress with normal gzip compression. It is impossible for xopen to know how the input data should be aligned. (The interface handles pure bytes objects, not HTS records). EDIT: 2024-02-16, I have since found out that this statement is false. The data does not need to be properly aligned and can simply be streamed. It is still advantegeous as it can still be indexed.

If you know beforehand that you do want BGZIP, there is also no need for xopen's dynamic choosing of compression algorithm. So you might be better of using subprocess and bgzip that way.

As regards to:

Python bindings available, but getting the compression into another process like xopen does makes for much better performance.

Pysam has a threads option when opening its files. That will already probably give more performance, since it does not have to manage open processes like xopen does. It also has the advantage of knowing exactly how the output blocks should be aligned, so the resulting compressed files can be indexed.

@pettyalex
Copy link
Author

For our specific usage case that I want this right now, it doesn't seem like Pysam has the interface for what we're wanting to do: bgzip output to create a .vcf.bgz file and associated tabix index, because downstream tools are expecting bgzipped vcfs with indexes.

Pysam has a threads option when opening its files. That will already probably give more performance, since it does not have to manage open processes like xopen does. It also has the advantage of knowing exactly how the output blocks should be aligned, so the resulting compressed files can be indexed.

I'm going to do some quick performance testing to see how using threads inside of the main Python process does vs opening another process and piping uncompressed data to it like xopen does. My gut would actually say that xopen's approach is faster, just because in the 1st approach, the process will either be copying data in from Python or compressing it in the pysam threadpool, but not both at the same time because of the GIL, while xopen allows Python to send data via pipe without blocking unless the pipe fills up.

I'm taking zstd support discussion to that open PR.

@rhpvorderman rhpvorderman changed the title Feature request - Is there potential to support bgzip and/or zstd? Feature request - Is there potential to support bgzip? Feb 25, 2022
@rhpvorderman
Copy link
Collaborator

Have you tried using PipedCompressionWriter with program_args = ["bgzip", "-i"]. Wouldn't that already solve your problem?

@pettyalex
Copy link
Author

That looks like it would, I'll validate this.

Would the xopen project be willing to make that support native, so that if you open a .bgz for writing, or a .gz and specify a bgzf or bgz format it will call bgzip?

@rhpvorderman
Copy link
Collaborator

For xz, bz2 and `gz`` xopen can fall back on modules from Python's standard library if command line applications are not present. There is no such thing for bgzip. Which means xopen can't guarantee that it can open bgzip files. Unless we would add a dependency that can read bgzip from python. At that point I think it is already quite a lot of work to get some support going and we can only support the features that bgzip has in common with the other formats. (so indexing can not be supported). The reason we can not do this, is that it would make the API more complicated with some features that only apply to bgzip. That would be a loss for the project, in my opinion.

Also the whole problem xopen tries to solve is that you don't know what output the user expects: stdout? gzip? uncompressed? Xopen infers that for you. If you already know what format you want (indexed bgzip in your case), then it is much less overhead to simply let the program output that format, rather than rely on xopen.

@marcelm What do you think?

@marcelm
Copy link
Collaborator

marcelm commented Feb 28, 2022

@pettyalex Is it correct that you want an indexed bgz file, that is, that you would want to run bgzip with -i?

For me, the question is whether we can add support for bgzip without adding another parameter to the xopen() function that only applies to bgzip. As Ruben said, we shouldn’t make the API more complicated.

The easiest case is that we just never create an index (run bgzip without -i).

Always generating an index is the other option, but I wonder whether we wouldn’t get requests to enable setting the index name (bgzip option -I).

I’m not too worried about bgzip not being available in the standard library. If bgzip isn’t installed, we just print an error and it’s the responsibility of whoever uses xopen to ensure that bgzip is in their list of dependencies.

Whatever we end up doing, wo should at least document the PipedCompressionWriter and PipedCompressionReader classes better.

@pettyalex
Copy link
Author

Is it correct that you want an indexed bgz file, that is, that you would want to run bgzip with -i?

Yes, absolutely. For our workflows, we need indexes everywhere that we'd be using bgzip, as bgzip itself performs worse both speed and compression-wise vs other options. The index is its only advantage. The default index name would also be appropriate in every situation we'd want to use this with.

@ghuls
Copy link

ghuls commented Jan 10, 2023

Yes, absolutely. For our workflows, we need indexes everywhere that we'd be using bgzip, as bgzip itself performs worse both speed and compression-wise vs other options. The index is its only advantage. The default index name would also be appropriate in every situation we'd want to use this with.

Do you use with the threading option? bgzip -@ 4 .... Also if you compile bgzip, you need to make sure you have libdeflate installed as it is much faster than standard zlib.
Some old zlib benchmarks by HTSlib (bgzip people): http://www.htslib.org/benchmarks/zlib.html

BGZF can be used without index. BCF2FASTQ uses it to compress FASTQ files, where multiple chunks can be compressed at the same time in different threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants