Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTEMIS-4305 Zero persistence does not work in kubernetes #4899

Closed
wants to merge 4 commits into from

Conversation

iiliev2
Copy link

@iiliev2 iiliev2 commented Apr 22, 2024

In a cluster deployed in kubernetes, when a node is destroyed it terminates the process and shuts down the network before the process has a chance to close connections. Then a new node might be brought up, reusing the old node’s ip. If this happens before the connection ttl, from artemis’ point of view, it looks like as if the connection came back. Yet it is actually not the same, the peer has a new node id, etc. This messes things up with the cluster, the old message flow record is invalid.

This also solves another similar issue - if a node goes down and a new one comes in with a new nodeUUID and the same IP before the cluster connections in the others timeout, it would cause them to get stuck and list both the old and the new nodes in their topologies.

The changes are grouped in tightly related incremental commits to make it easier to understand what is changed:

  1. Ping packets include nodeUUID
  2. Acceptors and connectors carry TransportConfiguration
  3. RemotingConnectionImpl#doBufferReceived tracks for ping nodeUUID mismatch with the target to flag it as unhealthy; ClientSessionFactoryImpl destroys unhealthy connections(in addition to not receiving any data on time)

@jbertram
Copy link
Contributor

jbertram commented Apr 26, 2024

I think you could simplify this quite a bit. Here's what I suggest...

  • Don't modify any packet aside from Ping and only modify it with a new byte[].
  • The broker should send its node ID in every Ping.
  • The first time the client receives a Ping it should save the node ID.
  • If a client ever receives a Ping with a node ID that is different from the one it has saved then it should disconnect.

You also needs tests to verify the fix and mitigate regressions in the future.

@iiliev2
Copy link
Author

iiliev2 commented Apr 26, 2024

Initially I attempted what you suggest about lazy initializing the node id like that, precicely because I wanted to keep the code changes to a minimum. However, that ended up being much more complicated(rather than simplified), because of the way ClientSessionFactoryImpl creates a new connection object on re-connects. It is very hard to reason about both when reading the code and when needing to debug it at runtime. So instead of this, I had to fill the missing gaps to use the data that is already there anyway, just wasn't being propagated deep enough.

IMO from a functional standpoint, adding the TransportConfiguration to the connector(and connection) is the right thing to do here anyway. I assume due to historical reasons, those classes were working with a subset of the data, and no one had a good reason to fix this until now. For example NettyConnection#getConnectorConfig was creating a bogus transport configuration, even though when it is created there is a configuration which was not being passed to it.

Ping is already the only Packet that is being modified. Why do you want to use a raw byte[] rather than UUID? IMO that will be more confusing - it suggests that there could be other kind of data that can be contained.

@iiliev2 iiliev2 force-pushed the topic/iiliev2/ARTEMIS-4305 branch from 7f8833c to 5eabef6 Compare June 7, 2024 14:33
@@ -408,10 +416,15 @@ public void endOfBatch(Object connectionID) {
}

private void doBufferReceived(final Packet packet) {
if (isHealthy && !isCorrectPing(packet)) {
isHealthy = false;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting this line out will effectively disable the fix. This will cause the new test ZeroPersistenceSymmetricalClusterTest to fail.

@clebertsuconic
Copy link
Contributor

If this is an issue in Core, it will be an issue in AMQP as well. we should make sure AMQP also takes care of this?

WDYT @jbertram @gemmellr @gtully @tabish121 ?

@jbertram
Copy link
Contributor

jbertram commented Jun 7, 2024

I believe the use-case here only involves cluster nodes and the core connections between them. Therefore, I don't think AMQP is in view.

@iiliev2 iiliev2 force-pushed the topic/iiliev2/ARTEMIS-4305 branch from 5eabef6 to 4c9dabe Compare June 13, 2024 08:52
@iiliev2
Copy link
Author

iiliev2 commented Jun 13, 2024

Fixed the checkstyle issues.

* include original node id in `TransportConfiguration` decoding
* match ping packets' nodeUUID against the connection's transport configuration target nodeUUID; if any side is missing this data, the match succeeds
* destroy a remoting connection if it ever becomes unhealthy(ping nodeUUID is different that the target)
@iiliev2 iiliev2 force-pushed the topic/iiliev2/ARTEMIS-4305 branch from 4c9dabe to cd49b15 Compare July 30, 2024 14:13
@jbertram
Copy link
Contributor

I pulled down your PR branch and executed org.apache.activemq.artemis.tests.integration.cluster.distribution.ZeroPersistenceSymmetricalClusterTest#test. I got this which doesn't seem right:

java.lang.NullPointerException: Cannot invoke "java.util.concurrent.ScheduledExecutorService.scheduleAtFixedRate(java.lang.Runnable, long, long, java.util.concurrent.TimeUnit)" because "this.s" is null

	at org.apache.activemq.artemis.tests.integration.cluster.distribution.ZeroPersistenceSymmetricalClusterTest.expectTopology(ZeroPersistenceSymmetricalClusterTest.java:158)
	at org.apache.activemq.artemis.tests.integration.cluster.distribution.ZeroPersistenceSymmetricalClusterTest.test(ZeroPersistenceSymmetricalClusterTest.java:90)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: org.opentest4j.AssertionFailedError: test has left a server socket open on port 61616
		at org.apache.activemq.artemis.tests.extensions.PortCheckExtension.afterEach(PortCheckExtension.java:53)
		... 2 more

Also, I put together a proof-of-concept fix on https://github.com/jbertram/activemq-artemis/tree/ARTEMIS-4305-alt which seems like would address the problem you're seeing but is much simpler than the fix in this PR. Can you test this fix?

@iiliev2
Copy link
Author

iiliev2 commented Aug 20, 2024

java.lang.NullPointerException: Cannot invoke "java.util.concurrent.ScheduledExecutorService.scheduleAtFixedRate(java.lang.Runnable, long, long, java.util.concurrent.TimeUnit)" because "this.s" is null

I had to update the test to work with JUnit 5(I didn't notice on my last rebase that Artemis upgraded from an earlier JUnit).

Also, I put together a proof-of-concept fix

I am having trouble running the multicast discovery tests on my mac. I setup my routes as described in https://issues.apache.org/jira/browse/ARTEMIS-2341 (like last time and that was sufficient to make multicast discovery work).
I also added -Djava.net.preferIPv4Stack=true to the broker stratup commands, but still I am getting Can't assign requested address error(even in other tests like DiscoveryTest).

So as it stands I am unable to verify if your fix works. I was only able to look at the code, and as far as I can tell, my concerns remain same as I mentioned earlier. However, if your fix works reliably, I have no objections.

Could you please pull the latest version of test ZeroPersistenceSymmetricalClusterTest and run it against your fix?

@jbertram
Copy link
Contributor

jbertram commented Aug 20, 2024

I added your test to the branch with my fix, and I can see my fix detecting a problem and closing the connection, but the test still fails, and I still see messages like this:

WARN  [org.apache.activemq.artemis.core.server] AMQ222139: MessageFlowRecordImpl [nodeID=13207315-5f2f-11ef-b63b-5c80b6f32172, connector=TransportConfiguration(name=netty-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory)?port=61616&host=localhost, queueName=$.artemis.internal.sf.my-cluster.13207315-5f2f-11ef-b63b-5c80b6f32172, queue=QueueImpl[name=$.artemis.internal.sf.my-cluster.13207315-5f2f-11ef-b63b-5c80b6f32172, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::name=localhost], temp=false]@3ca984c7, isClosed=false, reset=true]::Remote queue binding exampleQueue2dc45dca-5f2f-11ef-b5ea-5c80b6f32172 has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses

Is this the kind of message you see in your K8s cluster when this problem occurs and is that what you were referring to in the Jira when you said this?

This messes things up with the cluster, the old message flow record is invalid.

I reproduced this with a very simple manual test with 2 clustered nodes with persistence disabled. When I kill -9 one node and restart it I see the AMQ222139 message on the other node. However, I resolved this by simply changing the configuration on the cluster-connection to use:

<reconnect-attempts>0</reconnect-attemtps>

I then cherry-picked your ZeroPersistenceSymmetricalClusterTest test to the main branch. The test fails by default, but when I change the various broker.xml files used by that test to use 0 reconnect-attempts the test passes. Also, given the fact that persistence is disabled in your environment this is the configuration I would recommend for you. I'm not sure why I didn't think of this before. Have you considered this configuration change in your environment? It seems this would resolve your problem with no code changes necessary.

@iiliev2
Copy link
Author

iiliev2 commented Aug 21, 2024

Is this the kind of message ... you were referring to in the Jira

No, MessageFlowRecordImpl becoming bad means that the peer broker holds on to an instance of it which should have been discarded. We have a workaround hack in our code which detects when a peer broker is down or its identity changed, and forces the local broker to evict its message flow record for it. Something like

 this.removeMemberMethod = Topology
                        .class
                        .getDeclaredMethod("removeMember", long.class, String.class, boolean.class);
                this.removeMemberMethod.setAccessible(true);
...
removeMemberMethod.invoke(topology, topologyMember.getUniqueEventID() + 1000, topologyMemberId);

I reproduced this with a very simple manual test

I assume you mean you reproduced the WARN message. This is not what we are trying to fix.

I would recommend for you ... to use 0 reconnect-attempts ... given the fact that persistence is disabled

We were already running that earlier. That resulted in other bugs with the topology. Can't remember exact details anymore(was a couple of years ago), but sometimes when a peer came back after the reconnect attempts stopped, it would no longer be re-admitted in the cluster.
We don't want to go back to a tried configuration which we know didn't work.

I added your test to the branch with my fix ... but the test still fails

Do you mean it is a false negative(test failed but shouldn't have)? How do you verify that the topology recovers correctly?

@jbertram
Copy link
Contributor

No, MessageFlowRecordImpl becoming bad means that the peer broker holds on to an instance of it which should have been discarded.

As far as I can tell the AMQ222139 message is a direct symptom of the problem you describe. I don't think that message would occur if the broker wasn't holding an instance of MessageFlowRecordImpl which should have been discarded. Therefore, I think we're talking about the same essential problem.

We were already running that earlier. That resulted in other bugs with the topology.

If that's the case then I think that is the bug we should be attempting to fix. For your particular use-case I think you should definitely be using 0 for reconnect-attempts.

We don't want to go back to a tried configuration which we know didn't work.

Your current configuration with reconnect-attempts > 0 also doesn't work so I'm not clear what the difference here. If neither configuration works then you should use the configuration which is actually appropriate for your use-case and fix whatever bugs (if any) exist for that configuration rather than fixing a bug for a configuration that isn't recommended.

As it stands, the recommended configuration fixes the problem with the test-case in this PR so this fix is not valid and won't be merged.

@iiliev2
Copy link
Author

iiliev2 commented Aug 21, 2024

I am not sure I understand why should reconnect-attempts=0 be considered better in our case.

All these are supposed to be valid configurations, right? In all cases we are talking about a bug being fixed. We are perfeclty happy to use the current configuration(prefer it infact, as it has been fairliy stable for some time, apart for this one bug). Given that we've already spent so much resources on fixing this bug, I do not see a reason to use another configuration which we have to fix all over again.

Even if our current configuration is not the "best" one by some criteria, why should this bug remain in the code?

@jbertram
Copy link
Contributor

I am not sure I understand why should reconnect-attempts=0 be considered better in our case.

It's better because it actually fits your use-case. Since persistence is disabled then every time a broker restarts it will have a new identity. This is true even if you weren't on K8s. Since the broker's identity will change that means the cluster-connection will also need to be torn down and recreated so there's no reason to attempt to reconnect. Reconnecting a cluster connection only makes sense if the same exact server is coming back.

With reconnect-attempts > 0 and your fix from this PR then when a node leaves the cluster and gets recreated with a new identity using the same IP address the cluster-connection will reconnect only to then be destroyed when its discovered that the node's identity has changed. It would make more sense to simply not reconnect (i.e. using 0 for reconnect-attempts) and allow the node to join the cluster as any other new node would. This approach follows the current design and intended function of the cluster-connection.

All these are supposed to be valid configurations, right?

In your case I would say the configuration is not valid. To be clear, not all configurations are valid for all use-cases. For example, if your use-case required messages to survive a broker restart then setting persistence-enabled to false would be technically possible, but it would not be valid.

Even if our current configuration is not the "best" one by some criteria, why should this bug remain in the code?

It's not that your current configuration is not the "best." It's that your current configuration is directly leading to the problem with the stale MessageFlowRecordImpl and that could be fixed simply by changing your configuration.

When evaluating which changes should be merged into the code-base one must weigh the risk of the change (e.g. additional complexity, possibility for unintended consequences, performance impact, etc.) against the benefit. In this case I do not believe the benefits outweigh the risks given the fact that you're not using a valid configuration for your use-case. Ultimately I'm not sure I'd actually consider the current behavior a bug given that a valid configuration solves the issue.

That said, there may, in fact, be another bug that you hit with the valid configuration. I expect this bug would impact other users and would be worth investigating and resolving.

I empathize with you about the lost work/time regarding this PR. I know the feeling because I've lost lots of work myself when my own PRs were rejected. Ideally this is all part of the process to make the software better for everyone in the end.

Of course, you are free to maintain a fork of ActiveMQ Artemis and apply your own patches (e.g. this one) before deploying in your environment. That's one of the many great benefits of open source.

@iiliev2
Copy link
Author

iiliev2 commented Aug 21, 2024

For example, if your use-case required messages to survive a broker restart then setting persistence-enabled to false would be technically possible, but it would not be valid.

Yes, however that (may) be a recovarable situation - to just fix our configuration. In this case however, there is no recovering(unless we resort to hacks), no matter which configuration we use.

Running inside k8s is not the only use case we want to support. We also deploy in other modes, where nodes do keep identities between restarts. There can be all kinds of variations to the deployment. We need a single configuration that works in all cases, not multiple different ones, each of which is prone to various bugs(as history has proven). We would never get anything resolved that way.
We have communicated before we even started to work on this fix. The approach via the Ping packets has been validated with you.
There are plenty of tests to guarantee the robustness of these changes.
Why would effectively 2 additional longs in one kind of (management) message be that big of a performance hit(if that is your concern here)? How can we profile this?

@jbertram
Copy link
Contributor

I spoke at length with Ivan on Slack yesterday about this issue, and I wanted to summarize my thoughts here for posterity's sake.

The real issue that this PR is attempting to address is related to a quirk (for lack of a better word) in how TCP can sometimes work in a containerized environment. In short, it's possible for a TCP connection to be closed on one side without the other side receiving the appropriate RST. This is described in more detail here.

The use-case here involves a cluster of embedded brokers each running on separate K8s pods without persistence. If one of those pods is restarted by K8s using the same IP address then the "same" broker rejoins the cluster with a new identity (since the node ID is not persisted and reused) and when this TCP "quirk" happens the other clustered brokers don't clean up their cluster connection. Their TCP connection to the old broker apparently just becomes a connection to the new broker. This, of course, breaks things in weird ways with the internal state (e.g. MessageFlowRecord) of the cluster connection.

That said, the test on this PR is not testing this use-case. Instead it is testing a similar but technically different use-case where one node in a cluster is restarted with a new identity (i.e. node ID) on the same IP address. If the other nodes have reconnect-attempts > 0 on their cluster-connection they will reconnect to this node and things will break internally due to mismatched state. The solution to this problem is simply to use 0 for reconnect-attempts. No code changes are required. This is, of course, what I've recommended in previous comments. It took a conversation in Slack to clarify the actual use-case.

Ultimately, I do not believe the test on this PR is valid for the real problem that needs to be solved (i.e. the missing TCP RST). Furthermore, I'm not sure attempting to fix this problem at the level of the broker is appropriate. It seems to me that solving it at the network or infrastructure level would be more effective and wouldn't burden every other user with a protocol change for this edge case.

At this point I'm closing this PR as I don't see a future for it here.

@jbertram jbertram closed this Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants