Map shared memory region as read-only in receiver #248

phil-opp · 2023-04-13T09:55:18Z

This ensures that the data is not modified, even if shared with Python as zero-copy arrow array. Modifying the data could result in undefined behavior since there might be other subscribers that read the data at the same time. With this PR, a SEGFAULT will occur when attempting to modify shared memory data.

~~Blocked on elast0ny/shared_memory#100.~~ I created a (temporary) fork of the crate with the name shared_memory_extended.

This ensures that the data is not modified, even if shared with Python as zero-copy arrow array. Modifying the data could result in undefined behavior since there might be other subscribers that read the data at the same time.

The maintainer of the `shared_memory` crate did not react to our PR, so I decided to create a (temporary) fork.

The shared memory image must not be modified because it might be sent to other receivers too. Trying to modify it will result in a SEGFAULT now that we map the shared memory region as read-only.

phil-opp · 2023-06-21T10:43:59Z

@haixuanTao I needed to add a .copy() call to the Python operator example because the cv2.rectangle and cv2.putText calls apparently try modify the image in-place. Otherwise a SEGFAULT occurs.

This seems like a common footgun. Is there any way that we can prevent modification of shared memory data through our Python API? Arrow arrays are normally read-only, but it seems like this property either lost after the numpy conversion or ignored by cv2.

haixuanTao · 2023-06-26T02:04:22Z

Hi philipp, thanks for raising this issue.

It seems to be an issue within the to_numpy methods of pyarrow, imo.

As a first inspection, within the Cython code of pyarrow the function pyarrow.array.to_numpy is defined as follows:

    def to_numpy(self, zero_copy_only=True, writable=False):
        """
        Return a NumPy view or copy of this array (experimental).


        By default, tries to return a view of this array. This is only
        supported for primitive arrays with the same memory layout as NumPy
        (i.e. integers, floating point, ..) and without any nulls.


        For the extension arrays, this method simply delegates to the
        underlying storage array.


        Parameters
        ----------
        zero_copy_only : bool, default True
            If True, an exception will be raised if the conversion to a numpy
            array would require copying the underlying data (e.g. in presence
            of nulls, or for non-primitive types).
        writable : bool, default False
            For numpy arrays created with zero copy (view on the Arrow data),
            the resulting array is not writable (Arrow data is immutable).
            By setting this to True, a copy of the array is made to ensure
            it is writable.


        Returns
        -------
        array : numpy.ndarray
        """
        cdef:
            PyObject* out
            PandasOptions c_options
            object values


        if zero_copy_only and writable:
            raise ValueError(
                "Cannot return a writable array if asking for zero-copy")


        # If there are nulls and the array is a DictionaryArray
        # decoding the dictionary will make sure nulls are correctly handled.
        # Decoding a dictionary does imply a copy by the way,
        # so it can't be done if the user requested a zero_copy.
        c_options.decode_dictionaries = not zero_copy_only
        c_options.zero_copy_only = zero_copy_only


        with nogil:
            check_status(ConvertArrayToPandas(c_options, self.sp_array,
                                              self, &out))


        # wrap_array_output uses pandas to convert to Categorical, here
        # always convert to numpy array without pandas dependency
        array = PyObject_to_object(out)


        if isinstance(array, dict):
            array = np.take(array['dictionary'], array['indices'])


        if writable and not array.flags.writeable:
            # if the conversion already needed to a copy, writeable is True
            array = array.copy()
        return array

https://github.com/apache/arrow/blob/3b06a4c9f3f8c646af09df7b4138eae50fa08a0d/python/pyarrow/array.pxi#L1486C1-L1543C21

Which by default consider that the array is not writable, but i'm not sure if the information is passed on to numpy, where the bug is raised.

I'm going to investigate more.

haixuanTao · 2023-06-26T02:32:47Z

So it seems that the numpy array created by pyarrow.array.to_numpy is rightly marked as not writable:

arrow_array.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False

And assigning data into it doesn't work:

>>> arrow_array[0]=1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only

But, it seems that cv2 doesn't take into account this flag. See the following example:

import pyarrow as pa
import numpy as np

array = pa.array(np.zeros((100))).to_numpy().reshape((10,10))

assert array.flags.writeable == False, "Array should be not writable"

cv2.rectangle(array,(0,0),(5,5),(255),2)
print(array)

array([[255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255.,   0.,   0., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255.,   0.,   0., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])

I'll raise the issue with the cv2 team

haixuanTao · 2023-06-26T02:41:53Z

I have opened the issue: opencv/opencv-python#859

phil-opp · 2023-07-21T10:29:07Z

The upstream issue was fixed by opencv/opencv#24026. So I guess we just need to wait for a new release now?

haixuanTao · 2023-07-21T11:10:13Z

you mean a opencv 4.9 release right?

phil-opp · 2023-07-21T11:16:41Z

Yes, and a corresponding opencv-python release.

The question is whether we want to merge this PR anyway without waiting.A SEGFAULT is better than silent data corruption in my opinion...

haixuanTao · 2023-07-21T11:37:04Z

I'm ok, with merging this before the next opencv release that seems to not happen in the short term: https://github.com/opencv/opencv/milestone/54 .

I think that ultimately, we need to provide the right documentation. But the opencv dependency isn't really a direct dependency of dora-rs. It's a "user" dependency. So we should not delay our planned feature.

I will have to adapt code on dora-drives.

Base automatically changed from arrow-fixes to main April 18, 2023 03:56

phil-opp force-pushed the shmem-read-only branch from 89f4d4b to 43091bd Compare June 14, 2023 12:11

phil-opp added 2 commits June 21, 2023 09:53

Map shared memory region as read-only in receiver

f4498a4

This ensures that the data is not modified, even if shared with Python as zero-copy arrow array. Modifying the data could result in undefined behavior since there might be other subscribers that read the data at the same time.

Patch shared-memory dependency until PR is merged upstream

5ba0366

phil-opp force-pushed the shmem-read-only branch from 43091bd to 5ba0366 Compare June 21, 2023 07:53

phil-opp added 2 commits June 21, 2023 10:34

Switch to forked shared_memory_extended crate

9795736

The maintainer of the `shared_memory` crate did not react to our PR, so I decided to create a (temporary) fork.

Don't try to modify image in-place in python operator dataflow example

aee0512

The shared memory image must not be modified because it might be sent to other receivers too. Trying to modify it will result in a SEGFAULT now that we map the shared memory region as read-only.

phil-opp marked this pull request as ready for review June 21, 2023 10:39

phil-opp requested a review from haixuanTao June 21, 2023 10:44

haixuanTao approved these changes Jul 21, 2023

View reviewed changes

haixuanTao merged commit 53198b8 into main Jul 21, 2023
30 checks passed

haixuanTao deleted the shmem-read-only branch July 21, 2023 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map shared memory region as read-only in receiver #248

Map shared memory region as read-only in receiver #248

phil-opp commented Apr 13, 2023 •

edited

Loading

phil-opp commented Jun 21, 2023

haixuanTao commented Jun 26, 2023 •

edited

Loading

haixuanTao commented Jun 26, 2023

haixuanTao commented Jun 26, 2023

phil-opp commented Jul 21, 2023

haixuanTao commented Jul 21, 2023

phil-opp commented Jul 21, 2023

haixuanTao commented Jul 21, 2023 •

edited

Loading

Map shared memory region as read-only in receiver #248

Map shared memory region as read-only in receiver #248

Conversation

phil-opp commented Apr 13, 2023 • edited Loading

phil-opp commented Jun 21, 2023

haixuanTao commented Jun 26, 2023 • edited Loading

haixuanTao commented Jun 26, 2023

haixuanTao commented Jun 26, 2023

phil-opp commented Jul 21, 2023

haixuanTao commented Jul 21, 2023

phil-opp commented Jul 21, 2023

haixuanTao commented Jul 21, 2023 • edited Loading

phil-opp commented Apr 13, 2023 •

edited

Loading

haixuanTao commented Jun 26, 2023 •

edited

Loading

haixuanTao commented Jul 21, 2023 •

edited

Loading