Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Serialization bugs can cause node drops #624

Closed
Bukhtawar opened this issue Apr 28, 2021 · 7 comments · Fixed by #1885
Closed

[BUG] Serialization bugs can cause node drops #624

Bukhtawar opened this issue Apr 28, 2021 · 7 comments · Fixed by #1885
Assignees
Labels
bug Something isn't working distributed framework

Comments

@Bukhtawar
Copy link
Collaborator

Describe the bug
Most commonly backward incompatible changes to APIs like stats can cause ser/de over the wire to fail

To Reproduce

  1. If there is a serialization bug it would manifest itself in the InboundHandler#messageReceived causing to throw an IllegalStateException

  2. The exception is then caught at TcpTransport#handleException which ends up closing the specific TCP channel that caused the serialization issue mostly the REG channel

  3. Each channel has a close listener attached which on being closed closes all channels, thereby removing the connection

  4. NodeChannels is actually a TransportConnection implementation which again has a close listener attached which removes it from the list of connected nodes.

  5. At this point the connection target is gone and requests should see a NodeNotConnectedException

  6. NodeConnectionService has a periodic ConnectionChecker which then tries to re-establish all connections if a single connection target closes.

Expected behavior
Node disconnections and reconnections are too intrusive and ser de exceptions can be handled more gracefully

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]
@Bukhtawar Bukhtawar added the bug Something isn't working label Apr 28, 2021
@anasalkouz
Copy link
Member

Hi @Bukhtawar, Could you elaborate more about this issue? we don't know if this issue is really happening or you think it could happen. Are you able to reproduce it? or this is something you anticipated that it will break based in your understanding of the code.

@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented Nov 16, 2021

I have observed this happening while making bw incompatible change and can also be reproduced with a small code change on one node that serializes message in a format that cannot be read or deserialized on the other node

I can see if I can write an integration test that would demonstrate that

@reta
Copy link
Collaborator

reta commented Nov 18, 2021

@Bukhtawar I think this is related, elastic/elasticsearch#75334, right?

@Bukhtawar
Copy link
Collaborator Author

Looks similar to me @reta

@reta
Copy link
Collaborator

reta commented Dec 29, 2021

@Bukhtawar do you mind if I pick this one? have few ideas in mind, thank you

@reta
Copy link
Collaborator

reta commented Jan 12, 2022

@Bukhtawar a bit more details regarding the (de)serialization handling inside InboundHandler:

  1. If there is a serialization bug it would manifest itself in the InboundHandler#messageReceived causing to throw an IllegalStateException.

Interestingly, it is only raised when input stream has unconsumed bytes, the case when stream has less bytes than needed to reconstruct the payload is not classified as serialization issue (resulting in java.io.EOFExceptions). More importantly, there is a difference if the message is request or response:

  • if the message is request than all exceptions are caught inside handleRequest, the exceptions are not escalated to TcpTransport#handleException and the error response is returned
  • if the message is response than there are two outcomes:
    • when input stream has unconsumed bytes, the IllegalStateException is thrown, escalated up to TcpTransport#handleException which ends up closing the specific TCP channel that caused the serialization issue, example is below
     [2022-01-12T09:09:57,119][WARN ][o.o.t.TcpTransport       ] [...] exception caught on transport layer [Netty4TcpChannel{localAddress=/127.0.0.1:58454, remoteAddress=127.0.0.1/127.0.0.1:9300}], closing connection
     java.lang.IllegalStateException: Message not fully read (response) for requestId [131], handler [org.opensearch.transport.TransportService$ContextRestoreResponseHandler/org.opensearch.transport.TransportService$6/[cluster:monitor/state]:org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1@35ecd191/org.opensearch.action.support.ChannelActionListener@798eebdf], error [false]; resetting
             at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:166) ~[opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
             at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:108) ~[opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
             at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
             at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
             at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
             at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
             at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:94) [transport-netty4-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
             at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.69.Final.jar:4.1.69.Final]
             at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.69.Final.jar:4.1.69.Final]
             at java.lang.Thread.run(Thread.java:833) [?:?]
     [2022-01-12T09:09:57,123][INFO ][o.o.c.c.Coordinator      ] [...] master node [{runTask-0}{jQ-T77-VRrua5dCDqczv3A}{XWRZk1ptRHKYMteMm2bWyQ}{127.0.0.1}{127.0.0.1:9300}{dimr}{testattr=test}] failed, restarting discovery
     org.opensearch.transport.NodeDisconnectedException: [runTask-0][127.0.0.1:9300][disconnected] disconnected
    
    
    • when input stream throws java.io.EOFException (or any other issue manifests during the deserialization), it is wrapped into TransportSerializationException and the error response is sent, not reaching the TcpTransport#handleException in any way

The suggested changes are intended to report more details regarding the cause of the (de)serialization failure and provide uniformity in handling (de)serialization issues:

  1. Treat java.io.EOFException as (de)serialization issue, wrapping it into the IllegalStateException ("Message fully read (request) but more data is expected for requestId ..."): it does not lead to any changes in current behaviour, only more informative message is going to be returned to the client.
  2. For responses, when input stream has unconsumed bytes, through IllegalStateException and wrap it into TransportSerializationException, it changes the current behaviour: the exception is handled in a regular way, without escalating to TcpTransport#handleException and closing the channel (desired outcome) (the same way java.io.EOFException are handled). The example is below (to compare with the previous stack trace):
[2022-01-12T08:41:56,603][WARN ][o.o.t.InboundHandler     ] [...] Failed to deserialize response from [127.0.0.1/127.0.0.1:9300]
org.opensearch.transport.TransportSerializationException: Failed to deserialize response from handler [org.opensearch.transport.TransportService$ContextRestoreResponseHandler/org.opensearch.transport.TransportService$6/[cluster:monitor/state]:org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1@1314b3b2/org.opensearch.action.support.ChannelActionListener@4770f3a8]
        at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:320) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:155) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:109) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110) [opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:94) [transport-netty4-client-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.69.Final.jar:4.1.69.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.69.Final.jar:4.1.69.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.IllegalStateException: Message not fully read (response) for requestId [18], handler [org.opensearch.transport.TransportService$ContextRestoreResponseHandler/org.opensearch.transport.TransportService$6/[cluster:monitor/state]:org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1@1314b3b2/org.opensearch.action.support.ChannelActionListener@4770f3a8], error [false]; resetting
        at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:308) ~[opensearch-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
        ... 26 more

@reta
Copy link
Collaborator

reta commented Jan 14, 2022

@nknize sorry cannot do that myself, could you please add 2.0.0 and 1.3.0 as version labels, much appreciate it, 🙏

ritty27 pushed a commit to ritty27/OpenSearch that referenced this issue May 12, 2024
* Fix for CVE-2976 + add CVE checker

Signed-off-by: Omar Khasawneh <okhasawn@amazon.com>

* Updated Changelog

Signed-off-by: Omar Khasawneh <okhasawn@amazon.com>

---------

Signed-off-by: Omar Khasawneh <okhasawn@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants