Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python CopyFrom segmentation fault on version 4.24.0 #13485

Closed
sfc-gh-mussakli opened this issue Aug 9, 2023 · 27 comments · Fixed by protocolbuffers/upb#1514
Closed

python CopyFrom segmentation fault on version 4.24.0 #13485

sfc-gh-mussakli opened this issue Aug 9, 2023 · 27 comments · Fixed by protocolbuffers/upb#1514
Assignees

Comments

@sfc-gh-mussakli
Copy link

What version of protobuf and what language are you using?
Version: v4.24.0
Language: Python upb

What operating system (Linux, Windows, ...) and version?
Repros on Linux and MacOS

What runtime / compiler are you using (e.g., python version or gcc version)
python 3.8
Installed Apple clang version 14.0.3 (clang-1403.0.22.14.1)

What did you do?
We have a proto message that fundamentally looks like this:

syntax = "proto3";

package message;

message TestMessage {
    optional int64 field_one = 1;
    optional int64 field_two = 2;
    map<int32, double> map_field = 3;
}

And we have a piece of python that tries to copy an empty version of one of these messages into the other:

message = message_pb2.TestMessage()
message2 = message_pb2.TestMessage()
message2.CopyFrom(message)

This is resulting on a segmentation fault on exactly the CopyFrom line.

This failure started with us picking up version 4.24.0 with yesterday's release. Reverting back to any version prior fixes the issue.

Unfortunately building the above example proto as an MVP did not repro the segmentation fault for me. If there are any diagnostics I can provide that would make root causing this easier, happy to help.

What did you expect to see
No seg fault, behavior as before.

What did you see instead?
Segmentation fault failure on CopyFrom

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

Anything else we should know about your project / environment

@sfc-gh-mussakli sfc-gh-mussakli added the untriaged auto added to all issues by default when created. label Aug 9, 2023
@arcra
Copy link

arcra commented Aug 9, 2023

I've seen this issue with python 3.9 and 3.11 as well. It's blocking our project's CI runs.

@fowles fowles added python 24.x and removed untriaged auto added to all issues by default when created. labels Aug 9, 2023
arcra added a commit to tensorflow/tensorboard that referenced this issue Aug 9, 2023
…uf#13485 (#6538)

We currently don't have an upper bound on the protobuf dependency, but
the new release broke our CI.
@anandolee
Copy link
Contributor

Not sure why but I can NOT reproduce in google3 internal CL with

message TestMessage {
    optional int64 field_one = 1;
    optional int64 field_two = 2;
    map<int32, double> map_field = 3;
}

message = message_pb2.TestMessage()
message2 = message_pb2.TestMessage()
message2.CopyFrom(message)

Nor opensource on my linux:

$ pip show protobuf
Name: protobuf
Version: 4.24.0
$ python
Python 3.11.4 (main, Jun  7 2023, 10:13:09) [GCC 12.2.0] on linux
>>> from google.protobuf.internal import api_implementation
>>> print(api_implementation.Type())
upb
>>> from google.protobuf import timestamp_pb2
>>> msg = timestamp_pb2.Timestamp()
>>> msg2 = timestamp_pb2.Timestamp()
>>> msg2.CopyFrom(msg)
>>> from google.protobuf import struct_pb2
>>> msg = struct_pb2.Struct()
>>> msg2 =  struct_pb2.Struct()
>>> msg2.CopyFrom(msg)

@arcra
Copy link

arcra commented Aug 9, 2023

Just a thought, could it be that the protos that we're intending to use are "compiled" with a previous protoc version, and the issue is that this new version throws in that case? In our project, it was failing when trying to use some protos from a TensorFlow dependency (which was released before this new protobuf release from yesterday).

@anandolee
Copy link
Contributor

Can you help to verify:
1, Does the segment fault still raise if not depend on TensorFlow.
2, Does the segment fault still raise for the message that do not have map:

message TestMessage {
    optional int64 field_one = 1;
    optional int64 field_two = 2;
}

message = message_pb2.TestMessage()
message2 = message_pb2.TestMessage()
message2.CopyFrom(message)

3, Does the segment fault still raise for struct_pb2.Struct()

from google.protobuf import struct_pb2
msg = struct_pb2.Struct()
msg2 =  struct_pb2.Struct()
msg2.CopyFrom(msg)

@sfc-gh-mussakli
Copy link
Author

Just a thought, could it be that the protos that we're intending to use are "compiled" with a previous protoc version, and the issue is that this new version throws in that case? In our project, it was failing when trying to use some protos from a TensorFlow dependency (which was released before this new protobuf release from yesterday).

This might actually be possibly how it's happening, as I could not produce a simple repro outside of our internal product.

Can you help to verify: 1, Does the segment fault still raise if not depend on TensorFlow. 2, Does the segment fault still raise for the message that do not have map:

message TestMessage {
    optional int64 field_one = 1;
    optional int64 field_two = 2;
}

message = message_pb2.TestMessage()
message2 = message_pb2.TestMessage()
message2.CopyFrom(message)

3, Does the segment fault still raise for struct_pb2.Struct()

from google.protobuf import struct_pb2
msg = struct_pb2.Struct()
msg2 =  struct_pb2.Struct()
msg2.CopyFrom(msg)

I am working to answer your other questions, but I can confirm (1) that we do not depend on TensorFlow.

@sfc-gh-mussakli
Copy link
Author

sfc-gh-mussakli commented Aug 10, 2023

Can you help to verify: 1, Does the segment fault still raise if not depend on TensorFlow. 2, Does the segment fault still raise for the message that do not have map:

message TestMessage {
    optional int64 field_one = 1;
    optional int64 field_two = 2;
}

message = message_pb2.TestMessage()
message2 = message_pb2.TestMessage()
message2.CopyFrom(message)

3, Does the segment fault still raise for struct_pb2.Struct()

from google.protobuf import struct_pb2
msg = struct_pb2.Struct()
msg2 =  struct_pb2.Struct()
msg2.CopyFrom(msg)

Confirming for (3) that this does not fail with struct_pb2.Struct() and removing for (2), the segmentation fault still occurs with no map.

@anandolee
Copy link
Contributor

anandolee commented Aug 10, 2023

Thank you for these info. As both of you mentioned that the code is depending on old code gen, can you try if upgrade these gen files with newest protoc to see if it can resolve the issue?

@sfc-gh-mussakli
Copy link
Author

Thank you for these info. As both of you mentioned that the code is depending on old code gen, can you try if upgrade these gen files with newest protoc to see if it can resolve the issue?

This gives us some more insights, that might inform debugging this issue.

We use rules_proto Bazel library to build our protobufs. The latest version uses protobuf 21.10: https://rules-proto-grpc.com/en/latest/changelog.html#id5.

Upgrading to the latest version of rules_proto and building the proto with that still causes the segmentation fault when CopyFrom is called. My suspicion is that this is the root cause. Perhaps you can repro by compiling the proto with an older version like 21.10, and then running python with protobuf library version 4.24.0.

@arcra
Copy link

arcra commented Aug 10, 2023

Thank you for these info. As both of you mentioned that the code is depending on old code gen, can you try if upgrade these gen files with newest protoc to see if it can resolve the issue?

Are you asking this just to validate that the issue is not present if we do that? Or as a proposed solution for users encountering this issue?

Note that a lot of times (like in our case with the TensorFlow dependency) we're depending on a package that is not owned by us, generally, I don't thin it's feasible to require that from every user that gets broken by this change. Rather, this would restrict projects from using newer versions of protobuf libraries unless all dependencies also migrate to the latest version.

For validation, given that the protos are from dependencies we don't own, I expect you would have more context on what a "packaged" version of a proto would look like, or how to produce and use it, so it should be easier for you to find a way to produce some minimal dependency with the older version, and use it with a newer version, using virtualenv or something like this?

@anandolee
Copy link
Contributor

Understand that you may not able to upgrade the protoc version. The "produce some minimal dependency with the older version" under design currently.

Mert mentioned "This failure started with us picking up version 4.24.0 with yesterday's release. Reverting back to any version prior fixes the issue." Can you help to confirm that if 4.23.4 do not produce the segment fault and also make sure 4.23.4 is using upb python.

from google.protobuf.internal import api_implementation
print(api_implementation.Type())

If 4.23.4 works, it will make us easier to identify which change may trigger the issue

@arcra
Copy link

arcra commented Aug 10, 2023

For us, version 4.23.4 works fine.

Our work around was to restrict the protobuf dependency to be < 4.24: tensorflow/tensorboard@734988d, and I can see in the logs of our CI runs that indeed, protobuf==4.23.4 was used.

@anandolee
Copy link
Contributor

anandolee commented Aug 10, 2023

If it is using protoc 4.21.10, I think it is not related to old gen code. protoc 4.21.10 and 4.24.0 do not have differences on upb python generated code.

Just compared with runtime 4.23.4 and 4.24.0:
In 4.23.4 the CopyFrom() is a wrapper that calls SerializeToString() And ParseFromString(). For performance, 4.24.0 changed CopyFrom() to copy from memories. I think this is a cause.

However we still do not have enough info to repo which makes us hard to go forward....

As Mert said "I could not produce a simple repro outside of our internal product." Wondering if Adrian can provide a repo?

@sfc-gh-mussakli
Copy link
Author

For us, version 4.23.4 works fine.

Our work around was to restrict the protobuf dependency to be < 4.24: tensorflow/tensorboard@734988d, and I can see in the logs of our CI runs that indeed, protobuf==4.23.4 was used.

+1'ing this that 4.23.4 worked fine on our end. This code is stressed daily, and the immediate hours after 4.24.0 release is when failures started happening.

@anandolee, sounds like you have a good clue about what might be causing the crash.

@freewilll
Copy link

I am having this problem too.

@sfc-gh-mussakli
Copy link
Author

@anandolee, please let us know how we can help with getting a fix for this issue. We are happy to test private releases on our end to verify a hypothetical fix.

copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Aug 15, 2023
…ffers/protobuf#13485

We should go forward with memory copy once able to repo

PiperOrigin-RevId: 557178075
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Aug 15, 2023
…ffers/protobuf#13485

We will go forward with memory copy in 25.0

PiperOrigin-RevId: 557178075
@anandolee
Copy link
Contributor

anandolee commented Aug 15, 2023

We can rollback the change to serialize/parse wrap in 24.1 release to temporary unblock projects. But main will keep the same to use memory copy and 25.0 release will back to memory copy. This means the bug will return in 25.0 if we don't find an adequate repro in time.

Wondering if you can continue help us with repro for this issue?

Also, can you help to verify if protocolbuffers/upb#1486 work?

@sfc-gh-mussakli
Copy link
Author

@anandolee, I can try to verify if that fix works. Can you share instructions on how to install a private release containing that fix?

@sfc-gh-mussakli
Copy link
Author

The change has been merged and it will be release in 24.1 soon.

Main will keep to use memory copy and 25.0 release will back to memory copy

OK, we will test if 24.1 fixes this bug and that can be the verification that this is the root cause of the crash.

@sfc-gh-mussakli
Copy link
Author

@anandolee We have recently tested the crashing test on version 4.24.1 and I can confirm that the crash does not repro with this CopyFrom optimization disabled.

You said that 4.25.0 will re-introduce this memory copy. Does that mean that the bug is slated to be re-introduced as-is, or are you working on an implementation that won't seg fault?

@fowles
Copy link
Member

fowles commented Aug 21, 2023

We are hoping to figure out the root cause and fix it in the intervening time. If you can try to repro against protobuf mainline that would be helpful

@sfc-gh-mussakli
Copy link
Author

We are hoping to figure out the root cause and fix it in the intervening time. If you can try to repro against protobuf mainline that would be helpful

Happy to help, I'm assuming you're suggesting building the upb backend locally and testing against that. Are there dev instructions anywhere to guide with that?

@sfc-gh-mussakli
Copy link
Author

Good news, I have a repro 🎉 .

Turns out, the issue isn't to do with copying one empty initialized message into another. It's calling CopyFrom on an unset optional field.

Here is what I have:

syntax = "proto3";

package message;

message BigMessage {
    optional SurroundingMessage surrounding_message = 1;
}

message SurroundingMessage {
    optional TestMessage test_message = 1;
}

message TestMessage {
    optional int64 field_one = 1;
    optional int64 field_two = 2;
    map<int32, double> map_field = 3;
}

I have built this via protoc -I=. --python_out=. ./message.proto

Then, I execute the following test:

import message_pb2

big_message = message_pb2.BigMessage(
    surrounding_message=message_pb2.SurroundingMessage(),
)

test_message = message_pb2.TestMessage()
test_message.CopyFrom(big_message.surrounding_message.test_message)

You should get a segmentation fault on 4.24.0, and successful execution on any other version.

@anandolee
Copy link
Contributor

Thank you Mert! I can reproduce the error with your new example! Will work on a real fix!

@sfc-gh-psteijn
Copy link

Thank you Mert! I can reproduce the error with your new example! Will work on a real fix!

Thanks for the update Jie and Matt!

copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Aug 24, 2023
copybara-service bot pushed a commit that referenced this issue Aug 24, 2023
…eep copy

from memory

fix #13485

PiperOrigin-RevId: 559643466
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Aug 24, 2023
copybara-service bot pushed a commit that referenced this issue Aug 24, 2023
…eep copy

from memory

fix #13485

PiperOrigin-RevId: 559643466
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Aug 24, 2023
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Aug 24, 2023
copybara-service bot pushed a commit that referenced this issue Aug 24, 2023
…nstead of deep copy

from memory

fix #13485

PiperOrigin-RevId: 559867160
copybara-service bot pushed a commit that referenced this issue Aug 24, 2023
…nstead of deep copy

from memory

fix #13485

PiperOrigin-RevId: 559867160
copybara-service bot pushed a commit that referenced this issue Aug 24, 2023
…nstead of deep copy

from memory

fix #13485

PiperOrigin-RevId: 559867160
copybara-service bot pushed a commit that referenced this issue Aug 24, 2023
…nstead of deep copy

from memory

fix #13485

PiperOrigin-RevId: 559888172
@anandolee
Copy link
Contributor

anandolee commented Aug 25, 2023

The issue should been fixed by #1514
25.0 will back to memory copy and no crash for copy the unsigned empty sub message

@sfc-gh-mussakli
Copy link
Author

@anandolee Thank you, this is great to hear!

arcra added a commit to tensorflow/tensorboard that referenced this issue Oct 23, 2023
protocolbuffers/protobuf#13485 was fixed with protocolbuffers/upb#1514,
so this shouldn't be an issue anymore.

If CI passes successfully with this change, then it should be fine to
merge.
clrpackages pushed a commit to clearlinux-pkgs/R-tensorflow that referenced this issue Feb 7, 2024
…sion 2.15.0

Adrian RC (10):
      Restricts protobuf dependency to < 4.24 due to protocolbuffers/protobuf#13485 (#6538)
      Drop support for Python 3.8
      Add 2.14.0 relnotes to RELEASE.md
      Bump version at HEAD to 2.15.0a0
      Adds explicit dependency on `six` package. (#6581)
      Updates script to sync TF compat protos into TB repo.
      Updates compat TF protos by running our update.sh script.
      Pin TF version used by CI to 2.15.0.dev20231011 (closest to their branch cut)
      Add 2.15.0 relnotes to RELEASE.md
      TensorBoard 2.15.0

Brian Dubois (14):
      Tweak some patch and upgrade documentation. (#6520)
      Change return type of list_hyperparameters() operation. (#6537)
      Hparams: Generate metric_infos from data provider session groups. (#6539)
      Hparams: Fix metric info generation for sessions without run names. (#6541)
      Hparams: Generate metric values for data provider-based session groups. (#6543)
      Hparams: Fix internal Builder failure for test. (#6545)
      Hparams: Allow sessions without name to pull all run,tag combinations as metrics. (#6546)
      Filter metric values in /list_session_groups. (#6550)
      Import Fix: Remove DebuggerState from AppState. (#6554)
      Hparams: Support excluding metric information in HTTP requests. (#6556)
      Migrate hparams to use gtag.js-based logging for Google Analytics. (#6586)
      Hparams: Generate map of runs to hparams by matching with session names. (#6600)
      Give fixed width to runs table's checkbox and color columns. (#6637)
      Chore: Remove markdown freewisdom reference. (#6640)

James (8):
      MDC Migration: Apply tb theme to new mat components (#6562)
      Data Table: enable data table by default (#6563)
      HParams: Default Runs table sorting to ascending (#6587)
      HParams: Make icons buttons in data table header (#6594)
      HParams: Do not scroll regex filter in the horizontal direction (#6596)
      Chore: Remove wheel from requirements.txt (#6570)
      Add data-id attribute to RunsDataTable rows for internal tests (#6610)
      MDC Migration: Merging feature branch with all new Angular Components Styled as desired. (#6631)

Joyce (1):
      Set Token Permissions for github workflows (#6583)

Leo Gallucci (1):
      Relax google-auth-oauthlib dependency to breaking changes (#6609)

Nick Groszewski (1):
      Update data server build step to run with ubuntu 20.04 and change platform tag (#6636)

Riley Jones (24):
      HParams: Create hparams data source to fetch data from the hparams plugin (#6535)
      HParams: Create hparams effects to trigger data to be fetched from the hparams plugin (#6540)
      HParams: Create filter dialog component (#6493)
      Bug Fix: Start using expect_angular_legacy_material_checkbox (#6547)
      Hparam: Add selectors to retrieve data written by the new hparams effects (#6544)
      HParams: Add new state property to store hparam and metric filters (#6553)
      Hparams: refactor arguments provided to `getDashboardRuns` (#6555)
      Chore: Change all copybara expects to legacy variants (#6549)
      Bug Fix: Update the port scanning logic to properly skip ports in use (#6560)
      HParams: Allow runs in the time series dashboard to be filtered by hparam (#6488)
      HParams: Stop fetching hparams data using run effects (#6561)
      Chore: Fix internal sync (#6566)
      Bug Fix change run effects to use combine latest instead of forkjoin (#6568)
      Revert HParams: Stop fetching hparams data using run effects (#6561) (#6571)
      Suggested Cards: Create data source for persisting card interactions to localStorage (#6439)
      HParams: Add effect to remove hparam filters when the related column is removed (#6572)
      HParams: Bug fix - removing a column closes the filter modal (#6589)
      HParams: Align sub context menus to the top of the button which opened them (#6590)
      HParams: Enable `hparamsInTimeSeries` by default (#6565)
      Chore: Fix CI breakage (#6593)
      Bug Fix: update the range input widget to only show the slider when min and max are different (#6591)
      Improve styling of context menus (#6592)
      HParams: Bug fix when adding table columns copy them rather than mutate them (#6597)
      Hparams: Fix repainting bug (#6642)

Stanley Bileschi (1):
      Remove markdown test for deprecated behavior. (#6634)

Yating (14):
      Hparams: Change `read_hyperparameters()` return type (#6542)
      Hparams: Add `include_in_result` field and implement support for `hparams_to_include` (#6559)
      Hparams: Support `limit` for `DataProvider.list_hyperparameters()`. (#6569)
      Hparams: Set interval domain fields for float summary hparams. (#6574)
      Add `last_value` to `ScalarTimeSeries` interface. (#6579)
      Hparams: Apply limit to hparams retrieved from protos with `_hparams_/experiment` tag. (#6577)
      Hparams: Populate `differs` field for TF summary hparams. (#6582)
      Hparams: fix the bug where the last hparam was discarded when there is no limit (#6584)
      Hparams: Sort and apply limit to hparams from runs. (#6588)
      Hparams: Sort by `differs` then by `name`. (#6601)
      Hparams: fix documentation of session and session group (#6613)
      Image Summary: Officially start using vectorized EncodePng Op (#6611)
      fix typos (#6622)
      Hparams: Update docstrings  (#6633)

jameshollyer (2):
      Add 2.14.1 relnotes to RELEASE.md
      fix release notes mistake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment