Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image.rotate Segfault with TensorFlow 2.2.0rc0 and tf-nightly #1298

Closed
MarkDaoust opened this issue Mar 13, 2020 · 21 comments
Closed

image.rotate Segfault with TensorFlow 2.2.0rc0 and tf-nightly #1298

MarkDaoust opened this issue Mar 13, 2020 · 21 comments
Assignees

Comments

@MarkDaoust
Copy link
Member

MarkDaoust commented Mar 13, 2020

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

    • Colab
    • MacOs 10.15.3
  • TensorFlow version and how it was installed (source or binary):

    • !pip install tensorflow==2.2.0rc0
    • !pip install tf-nightly==* # All nightlies I can see are affectd.
  • TensorFlow-Addons version and how it was installed (source or binary):

    • pip install tensorflow_addons==0.8.3
    • pip install tfa-nightly
  • Python version:

    • Colab: 3.6.9
    • macOS: 3.7.5
  • Is GPU used? (yes/no):

    • No.

Describe the bug

tfa.image.rotate segfaults.

It's the only tfa op I tried.

Code to reproduce the issue

Colab:

https://drive.google.com/open?id=1T5le-3aLhzsxa62uFGOsc6UpzzTEwN3a

Easy to reproduce locally on my macbook.

Other info / logs

See notebook.

I followed it in python up to this line: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/eager/execute.py#L59

@MarkDaoust
Copy link
Member Author

We're working on the 2.2 release right now, so it would be good to determine if this is a problem in tensorflow, or a problem in tfa.

@seanpmorgan
Copy link
Member

seanpmorgan commented Mar 13, 2020

Thanks @MarkDaoust we've also seen issue with users running tf-nightly. I'll have some time to look into this on the weekend.

@yifeif @gunan @perfinion @angerson Does anything come to mind that would break compaitbility. TFA links against libtesorflow of TF2.1 but I believe we'll still be shipping manylinux2010 whls built with dt7 for 2.2? Not sure why we would be getting segfaults.

@gabrieldemarmiesse
Copy link
Member

gabrieldemarmiesse commented Mar 13, 2020

@seanpmorgan Since the ABI is not compatible from TF 2.1.0 to 2.2.0, isn't that completely normal that we get segfaults? We compile the nightly against 2.1, so it won't work with 2.2. We just need to recompile right? Or I am missing something here?

When running the notebook with !pip install tfa-nightly I get the warning:

This version of TensorFlow Addons requires TensorFlow 2.1.0.
Detected an installation of version 2.2.0-dev20200201.

While some functions might work, TensorFlow Addons was not tested
with this TensorFlow version. Also custom ops were not compiled
against this version of TensorFlow. If you use custom ops,
you might get errors (segmentation faults for example).

It might help you to fallback to pure Python ops with
TF_ADDONS_PY_OPS . To do that, see
https://github.com/tensorflow/addons#gpucpu-custom-ops

If you encounter errors, do *not* file bugs in GitHub because
the version of TensorFlow you are using is not supported.

So I don't really get why this segfault is a problem.

@seanpmorgan
Copy link
Member

seanpmorgan commented Mar 13, 2020

@seanpmorgan Since the ABI is not compatible from TF 2.1.0 to 2.2.0, isn't that completely normal that we get segfaults? We compile the nightly against 2.1, so it won't work with 2.2. We just need to recompile right? Or I am missing something here?

So to my knowledge TF2.2 and TF2.1 should be compiled in the same way (devtoolset7/manylinux2010 compliant). Linking to the same ops shouldn't raise a segfault. We would like our package to be compatible with multiple versions of TF so that for example: our build matrix 0.8.3 can be compatible with tensorflow>=2.1.

For the moment the version check is the best way to prevent a poor user experience, but it shouldn't be required.

We didn't have this issue when moving from 2.0 -> 2.1. The reason for the incompatibility at that point was we started using the @tf.keras.utils.register_keras_serializable decorator which wasn't available in TF2.0.

@gabrieldemarmiesse
Copy link
Member

I see, I though that even if the same toolchain was used, there was no garantee that the ABI would stay compatible if the C++ code in tensorflow changed. Thanks for the explanation.

@seanpmorgan
Copy link
Member

seanpmorgan commented Mar 13, 2020

I see, I though that even if the same toolchain was used, there was no garantee that the ABI would stay compatible if the C++ code in tensorflow changed. Thanks for the explanation.

Yeah no ABI compatibility between compiler toolchains. When someone uses TFA with conda compiled TF we get a symbol mismatch because the ops are mangled differently (among other possible issues).

If the C++ code in tensorflow is changed it could break if not backwards compatible, but the ops should all be public and stable between minor versions.

@AakashKumarNain
Copy link
Member

AakashKumarNain commented Mar 13, 2020

When someone uses TFA with conda compiled TF we get a symbol mismatch because the ops are mangled differently (among other possible issues).

Another reason why there should be a standard way(which IMO should be conda) going forward so that we don't run into such issues

@seanpmorgan
Copy link
Member

seanpmorgan commented Mar 13, 2020

Another reason why there should be a standard way(which IMO should be conda) going forward so that we don't run into such issues

Agree tensorflow/community#133 is the fix for that but haven't heard any updates in a while. If that doesn't happen though then conda first would be great, but there's still too many python users not on conda to be excluded

@gabrieldemarmiesse
Copy link
Member

So if that helps:

I tried with tensorflow-cpu to see if we could get the segfault.

pip install tensorflow-cpu==2.1.0
python configure.py --no-deps
bash tools/install_so_files.sh
pip install tensorflow-cpu==2.2.0rc0
pytest tensorflow_addons/image
# segfault here

But recompiling works:

pip install tensorflow-cpu==2.2.0rc0
python configure.py --no-deps
bash tools/install_so_files.sh
pytest tensorflow_addons/image
# tests are passing

I opened a pull request to allow us to transition gracefully to 2.2.0 without having to make an enormous pull request at once: #1303

I hope this help, my knowledge of C++/compiling is limited.

@seanpmorgan
Copy link
Member

seanpmorgan commented Mar 14, 2020

Thanks very helpful! If I'm unable to find a resolution to this we'll likely need to backport the strict version check to 0.8 prior to a 0.9 release pinned to 2.2

@seanpmorgan
Copy link
Member

seanpmorgan commented Mar 14, 2020

So I tested using gdb for an image op and activation op and they both are crash on the same call. Unsure if this is just the first call or the only cause (both are ABSL string calls):

pip install tensorflow_addons
pip install tensorflow==2.2rc0
import tensorflow as tf
import tensorflow_addons as tfa

x = tf.constant([-2.0, -1.0, 0.0, 1.0, 2.0], dtype=tf.dtypes.float32)
tfa.activations.gelu(x, True)
gdb python
run debug_gelu.py

yields:

[New Thread 0x7ffea3fff700 (LWP 31150)]
[New Thread 0x7ffe9bfff700 (LWP 31151)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff9806c6e8 in tensorflow::AttrSlice::Find(absl::string_view) const () from /home/sean-morgan/miniconda3/envs/addons-2.2/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so.2

In order to confirm that that linking to libtensorflow_framework.so.2 should work regardless of which version it was compiled against; I installed TFA==0.6.0 (compiled on TF2.0) and ran the same script above with TF2.1 installed. It completed successfully

pip install tensorflow_addons==0.6.0 --no-deps
pip install tensorflow==2.1
python debug_gelu.py
tf.Tensor([-0.04540229 -0.158808    0.          0.841192    1.9545977 ], shape=(5,), dtype=float32)

@mihaimaruseac I was wondering if you could weight in on this?

  1. Is it a correct assumption that linking to libtensorflow_framework.so.2 should be stable if compiled against TF2.1 and linking at runtime of TF2.2?
  2. Have there been any changes to ABSL that might have caused this? I know there were some RFCs for TF strings but not sure whats been changed on 2.2

@seanpmorgan
Copy link
Member

FWIW I checked the activation ops compiled against TF2.0 and they work successfully with TF2.0 or TF2.1 but fail on TF2.2rc0. At this point I'm pretty confident there was a breaking change in the 2.2 libtensorflow

@mihaimaruseac
Copy link

Afaik we don't have compatibility guarantees for libtensorflow. So we cannot link against libtensorflow built at a different version.

@seanpmorgan
Copy link
Member

Afaik we don't have compatibility guarantees for libtensorflow. So we cannot link against libtensorflow built at a different version.

Thanks! That's rather unfortunate because it means users who want new features from Addons will be required to upgrade their TF core version which is a tough ask for many users. I'll make a post on developers group discussion to describe this issue and see what we can do going forward.

@gabrieldemarmiesse
Copy link
Member

@mihaimaruseac if I understand correctly, tensorflow won't give any backward compatibilities garantees about the ABI until tensorflow/community#133 is implemented, is that right? Is that the case too when the minor version in increased? If we compile against TF 2.1.0, will it work with 2.1.1?

@seanpmorgan seanpmorgan reopened this Mar 14, 2020
@mihaimaruseac
Copy link

@gabrieldemarmiesse patch number increase should be compatible

There are multiple RFCs for modularization and to add ABI guarantees. For example, I'm working on tensorflow/community#101 and the general ideas are in tensorflow/community#77

@seanpmorgan
Copy link
Member

seanpmorgan commented Mar 14, 2020

Just to be clear, if there are breaking changes to libtensorflow as minor versions increase then ABI stable kernels (tensorflow/community#133) will still crash right? We're seeing this issue when compiling with the same toolchain so the ABI should be the same it's just that underlying op has changed

@AakashKumarNain
Copy link
Member

AakashKumarNain commented Mar 14, 2020

That's rather unfortunate because it means users who want new features from Addons will be required to upgrade their TF core version which is a tough ask for many users.

Agreed, especially if the user has compiled TF from source

@failure-to-thrive
Copy link
Contributor

They could change something in their public API, too. For example, TFA creates a helper object declared in the "outdated" TF 2.1 headers but TF 2.2 doesn't expect it of course. C++ name mangling does not handle it.

@gabrieldemarmiesse
Copy link
Member

I'm going to close this issue as the bug was adressed and we can't do anything about it. I'm going to open another one about how to make a better UX for the users affected by the unstability of the ABI.

@seanpmorgan
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants