Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyvips.Image.new_from_array from np.mmap vs hdf5 #492

Open
rgluskin opened this issue Aug 8, 2024 · 7 comments
Open

pyvips.Image.new_from_array from np.mmap vs hdf5 #492

rgluskin opened this issue Aug 8, 2024 · 7 comments

Comments

@rgluskin
Copy link

rgluskin commented Aug 8, 2024

Hi.
I'm performing manipulations on WSIs and saving intermediate results in HDF5.
The end results are being saved to a new WSI using pyvips.Image.new_from_array.

However when running on very large slides I get OOM exceptions.

I did a quick experiment and saw that saving the intermediate results in np.mmap doesn't cause this issue.
From pyvips code it seems that the difference stems from __array_interface__ vs __array__ attributes (mmap has both but hdf5 only has the latter). However the fields that are actually accessed are fields such as dtype which are present in hdf5 anyway.

What would be your recommendation ?
Should I rewrite my whole code to use mmap ?
Or is there a more elegant approach ?

Thanks in advance.

@jcupitt
Copy link
Member

jcupitt commented Aug 8, 2024

Hi @rgluskin,

Could you share some sample code that shows the problem? How are you saving in HDF5? Is this with write_to_file('xxx.mat'), or some numpy feature?

As far as I know, HDF5 is not a good format for large images, it needs huge amounts of memory. I would use (maybe) uncompressed pyramidal tiled TIFF, or perhaps jpeg-compressed, depending on the image size.

Are you sure you need to save intermediates? If you stick to pyvips, you can do most processing with no need for intermediates. What operation is forcing you to save?

@rgluskin
Copy link
Author

rgluskin commented Aug 8, 2024

I'm running various processing, including classical vision and deep learning pipelines.
HDF5 is used to store the results on disk, not in memory.
This is the gist of the code, I think it's enough to reproduce the issue.

with h5py.File(hdf5_fname, "w") as heatmap_file:
    heatmap_file.create_dataset(
        "value_map",
        (self.height, self.width, n_channels),
        dtype=dtype,
        chunks=self.HDF5_CHUNK_SIZE + (n_channels,),
        compression="gzip",
        fillvalue=0.0,
    )
self.heatmap_file = h5py.File(hdf5_fname, "r+")
self.value_map = self.heatmap_file["value_map"]
...
...

full_scale_image = self.value_map
...
...
# Tif resolution is pixels/cm, openslide resolution is um/pixel. VIPS interprets the parameter as pixels/mm
vips_res = 1e3 / resolution_umpp
pyvips.Image.new_from_array(full_scale_image, interpretation="rgb").write_to_file(file_path, pyramid=True, tile=True, bigtiff=True, xres=vips_res, yres=vips_res, compression="lzw")

Disk size is not an issue, only having to reallocate the whole image to RAM is (when return cls.new_from_array(obj.__array__()) is executed in vimage.py )

@jcupitt
Copy link
Member

jcupitt commented Aug 8, 2024

HDF5 is used to store the results on disk, not in memory.

I think loading HDF5 from disc will need a lot of memory, won't it? And it'll use amazing amounts of disc space -- your WSI scans will probably have been through jpeg already, so there's little chance of quality loss, I'd think.

You need to make a complete test program I can run that shows the problem. Otherwise I'll waste 30 minutes making a test myself, do something slightly different from your code, and not hit the same issue. Or that's what usually happens to me :(

@rgluskin
Copy link
Author

rgluskin commented Aug 8, 2024

Sorry here's the full code to reproduce:

import h5py
import pyvips

hdf5_fname = "temp.hdf5"
file_path = "temp.tiff"
resolution_umpp = 0.25
height = 100000
width = 192000
n_channels = 4
dtype = 'uint8'
HDF5_CHUNK_SIZE = (128, 128)

with h5py.File(hdf5_fname, "w") as heatmap_file:
    heatmap_file.create_dataset(
        "value_map",
        (height, width, n_channels),
        dtype=dtype,
        chunks=HDF5_CHUNK_SIZE + (n_channels,),
        compression="gzip",
        fillvalue=0.0,
    )
heatmap_file = h5py.File(hdf5_fname, "r+")
value_map = heatmap_file["value_map"]

full_scale_image = value_map

# Tif resolution is pixels/cm, openslide resolution is um/pixel. VIPS interprets the parameter as pixels/mm
vips_res = 1e3 / resolution_umpp
pyvips.Image.new_from_array(full_scale_image, interpretation="rgb").write_to_file(
    file_path, pyramid=True, tile=True, bigtiff=True, xres=vips_res, yres=vips_res, compression="lzw"
)

@rgluskin
Copy link
Author

rgluskin commented Aug 8, 2024

and this snippet works

import h5py
import numpy as np
import pyvips

hdf5_fname = "temp.hdf5"
file_path = "temp.tiff"
resolution_umpp = 0.25
height = 100000
width = 192000
n_channels = 4
dtype = 'uint8'

value_map = np.memmap(hdf5_fname, dtype=dtype, mode='w+', shape=(height, width, n_channels))


full_scale_image = value_map

# Tif resolution is pixels/cm, openslide resolution is um/pixel. VIPS interprets the parameter as pixels/mm
vips_res = 1e3 / resolution_umpp
pyvips.Image.new_from_array(full_scale_image, interpretation="rgb").write_to_file(
    file_path, pyramid=True, tile=True, bigtiff=True, xres=vips_res, yres=vips_res, compression="lzw"
)

My default course of action would be to refactor all such hdf5 usages to use np.memmap instead.
Unless you see a more elegant solution (i.e. somehow recognize that shape and dtype are present in hdf5 and read it from there)

@jcupitt
Copy link
Member

jcupitt commented Aug 8, 2024

Great! Thanks for that. I tried:

$ VIPS_PROGRESS=1 ./rgluskin.py 
mapping ...
making vips image ...
writing ...
rgluskin.py temp-2: 192000 x 100000 pixels, 32 threads, 192000 x 1 tiles, 640 lines in buffer
rgluskin.py temp-2: 24% complete
...

And watched it run in top. It allocated 78gb of VIRT at the start, then as the save executed, RES slowly creept up until, when the save ticker reached 100%, it was equal to VIRT.

I think this is probably unavoidable with HDF5 files opened via numpy. Does it have to be HDF5? TIFF (for example) should work well, eg. with your test file in TIFF format I see:

$ VIPS_PROGRESS=1 /usr/bin/time -f %M:%e  vips copy temp.tiff x.tif[tile,pyramid,compression=jpeg]
vips temp-6: 192000 x 100000 pixels, 32 threads, 128 x 128 tiles, 640 lines in buffer
vips temp-6: done in 151s          
memory: high-water mark 1.02 GB
1368380:151.39

1.3gb of peak memory use. Q85 jpeg should be no worse than LZW (assuming your WSI scanner outputs something like SVS), and much faster.

@rgluskin
Copy link
Author

rgluskin commented Aug 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants