Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map shared memory region as read-only in receiver #248

Merged
merged 4 commits into from
Jul 21, 2023
Merged

Conversation

phil-opp
Copy link
Collaborator

@phil-opp phil-opp commented Apr 13, 2023

This ensures that the data is not modified, even if shared with Python as zero-copy arrow array. Modifying the data could result in undefined behavior since there might be other subscribers that read the data at the same time. With this PR, a SEGFAULT will occur when attempting to modify shared memory data.

Blocked on elast0ny/shared_memory#100. I created a (temporary) fork of the crate with the name shared_memory_extended.

Base automatically changed from arrow-fixes to main April 18, 2023 03:56
This ensures that the data is not modified, even if shared with Python as zero-copy arrow array. Modifying the data could result in undefined behavior since there might be other subscribers that read the data at the same time.
The maintainer of the `shared_memory` crate did not react to our PR, so I decided to create a (temporary) fork.
The shared memory image must not be modified because it might be sent to other receivers too. Trying to modify it will result in a SEGFAULT now that we map the shared memory region as read-only.
@phil-opp phil-opp marked this pull request as ready for review June 21, 2023 10:39
@phil-opp
Copy link
Collaborator Author

@haixuanTao I needed to add a .copy() call to the Python operator example because the cv2.rectangle and cv2.putText calls apparently try modify the image in-place. Otherwise a SEGFAULT occurs.

This seems like a common footgun. Is there any way that we can prevent modification of shared memory data through our Python API? Arrow arrays are normally read-only, but it seems like this property either lost after the numpy conversion or ignored by cv2.

@phil-opp phil-opp requested a review from haixuanTao June 21, 2023 10:44
@haixuanTao
Copy link
Collaborator

haixuanTao commented Jun 26, 2023

Hi philipp, thanks for raising this issue.

It seems to be an issue within the to_numpy methods of pyarrow, imo.

As a first inspection, within the Cython code of pyarrow the function pyarrow.array.to_numpy is defined as follows:

    def to_numpy(self, zero_copy_only=True, writable=False):
        """
        Return a NumPy view or copy of this array (experimental).


        By default, tries to return a view of this array. This is only
        supported for primitive arrays with the same memory layout as NumPy
        (i.e. integers, floating point, ..) and without any nulls.


        For the extension arrays, this method simply delegates to the
        underlying storage array.


        Parameters
        ----------
        zero_copy_only : bool, default True
            If True, an exception will be raised if the conversion to a numpy
            array would require copying the underlying data (e.g. in presence
            of nulls, or for non-primitive types).
        writable : bool, default False
            For numpy arrays created with zero copy (view on the Arrow data),
            the resulting array is not writable (Arrow data is immutable).
            By setting this to True, a copy of the array is made to ensure
            it is writable.


        Returns
        -------
        array : numpy.ndarray
        """
        cdef:
            PyObject* out
            PandasOptions c_options
            object values


        if zero_copy_only and writable:
            raise ValueError(
                "Cannot return a writable array if asking for zero-copy")


        # If there are nulls and the array is a DictionaryArray
        # decoding the dictionary will make sure nulls are correctly handled.
        # Decoding a dictionary does imply a copy by the way,
        # so it can't be done if the user requested a zero_copy.
        c_options.decode_dictionaries = not zero_copy_only
        c_options.zero_copy_only = zero_copy_only


        with nogil:
            check_status(ConvertArrayToPandas(c_options, self.sp_array,
                                              self, &out))


        # wrap_array_output uses pandas to convert to Categorical, here
        # always convert to numpy array without pandas dependency
        array = PyObject_to_object(out)


        if isinstance(array, dict):
            array = np.take(array['dictionary'], array['indices'])


        if writable and not array.flags.writeable:
            # if the conversion already needed to a copy, writeable is True
            array = array.copy()
        return array

https://github.com/apache/arrow/blob/3b06a4c9f3f8c646af09df7b4138eae50fa08a0d/python/pyarrow/array.pxi#L1486C1-L1543C21

Which by default consider that the array is not writable, but i'm not sure if the information is passed on to numpy, where the bug is raised.

I'm going to investigate more.

@haixuanTao
Copy link
Collaborator

So it seems that the numpy array created by pyarrow.array.to_numpy is rightly marked as not writable:

arrow_array.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : False
  ALIGNED : True
  WRITEBACKIFCOPY : False

And assigning data into it doesn't work:

>>> arrow_array[0]=1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only

But, it seems that cv2 doesn't take into account this flag. See the following example:

import pyarrow as pa
import numpy as np

array = pa.array(np.zeros((100))).to_numpy().reshape((10,10))

assert array.flags.writeable == False, "Array should be not writable"

cv2.rectangle(array,(0,0),(5,5),(255),2)
print(array)
array([[255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255.,   0.,   0., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255.,   0.,   0., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255., 255.,   0.,   0.,   0.],
       [255., 255., 255., 255., 255., 255.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.]])

I'll raise the issue with the cv2 team

@haixuanTao
Copy link
Collaborator

I have opened the issue: opencv/opencv-python#859

@phil-opp
Copy link
Collaborator Author

The upstream issue was fixed by opencv/opencv#24026. So I guess we just need to wait for a new release now?

@haixuanTao
Copy link
Collaborator

you mean a opencv 4.9 release right?

@phil-opp
Copy link
Collaborator Author

Yes, and a corresponding opencv-python release.

The question is whether we want to merge this PR anyway without waiting.A SEGFAULT is better than silent data corruption in my opinion...

@haixuanTao
Copy link
Collaborator

haixuanTao commented Jul 21, 2023

I'm ok, with merging this before the next opencv release that seems to not happen in the short term: https://github.com/opencv/opencv/milestone/54 .

I think that ultimately, we need to provide the right documentation. But the opencv dependency isn't really a direct dependency of dora-rs. It's a "user" dependency. So we should not delay our planned feature.

I will have to adapt code on dora-drives.

@haixuanTao haixuanTao merged commit 53198b8 into main Jul 21, 2023
30 checks passed
@haixuanTao haixuanTao deleted the shmem-read-only branch July 21, 2023 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants