-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replication breaks down after vertices and edges added while one of the servers is down #1520
Comments
After the second server is restarted, here is what the cluster status is: {
"@type":"d","@version":0,
"members":[{
"@type":"d","@version":0,
"id":"10.0.24.166:2434",
"listeners":[{"protocol":"ONetworkProtocolBinary","listen":"10.0.24.166:2424"},
{"protocol":"ONetworkProtocolHttpDb","listen":"10.0.24.166:2480"}],
"alias":"10.0.24.166:2434","status":"online"
},{
"@type":"d","@version":0,
"id":"10.0.24.166:2435",
"listeners":[{"protocol":"ONetworkProtocolBinary","listen":"10.0.24.166:2425"},
{"protocol":"ONetworkProtocolHttpDb","listen":"10.0.24.166:2481"}],
"alias":"10.0.24.166:2435",
"status":"aligning"
}],
"name":"_hzInstance_1_orientdb",
"local":"10.0.24.166:2434"} In other words, the server remains stuck "aligning" forever. |
Here's the kind of error you get in the first server (the one that was online all along) just after the "failed" server restarts. Note that this log was taken in a different test run, and running 1.4.1-SNAPSHOT (ccd9b3b), but the pattern is the same every time on each server:
The thing that looks very suspicious to me is the fact that when the operations are replayed, the version number of the record being updated jumps back and forth between v4 and v3 (it gets created as v4, then updated as v3 many times, etc.). I would have expected either that the operations are replayed in sequence (first created as v1 (or is it v0?)), then updated to v2, v3 and finally v4) or that all the changes are bundled into a single update (create record as v4, with all properties modified in previous steps included). It would be really nice to have some sort of an evaluation here, so we know at least what the expected behavior is, if and when it could be fixed, etc. Or if we're on our own then just let us know, at least we'll know where we stand and can make appropriate decisions based on complete and accurate information. Thanks! |
After that, the re-alignment attempts keep occurring periodically (every two minutes), but starting from the second one we can see that it skips the first
The subsequent operations however never succeed (even the ones that are theoretically independent from the one that fails). |
We fixed 4 issues on replication in the last days. Could you retry against the 1.4.1-SNAPSHOT please? |
Thanks for the update Luca. It does look much better now, although on my first test run I encountered some bizarre results. I had the correct total edge count on both servers, but some of the edges were created as documents instead of properties. Semantically this is equivalent so by itself it would be acceptable, but the problem is that all the edges created this way were incorrectly pointing to vertex #18:0 instead of new vertices like #18:4, #18:5 and so on. I'm not sure if this was caused by the particular timing or sequence of operations, a bug in the test code or what. One particularity of this degenerate test is that the different vertices all have the exact same properties and values (aside from their identity), so if some OrientDB code is hoping to differentiate them by value (I doubt it's the case but thought I would mention it anyway), it won't work. After re-running some scenarios where I double-check the state of the client and various servers at every step of the way, however, I was able to verify that in most cases at least the end result is correct. One thing I'd like to mention however is that when the "primary" server (the one the application initially connects to) is down, and we automatically fall back to the "secondary" server, everything slows down considerably (going from a few milliseconds to something like 30 seconds to re-read a table containing about 5 vertices with 3 properties each, for example). Now granted, we do force reload of vertices explicitly in our code to bypass the cache (for now), which might not be optimal. I have yet to investigate this further, but wanted to mention it so you have a heads-up; maybe you'll know what causes this. It does feel like every operation tries to talk to the primary server first, even when we could (or should) know it's down already, and has to timeout before falling back to the second server. The additional delays appear to be proportional to the number of vertices that the table in question contains. So, thanks again and I'll keep you posted regarding the progress of my investigation. |
One more thing I forgot to mention... You may recall that we're running OrientDB embedded in our server application. We noticed that when we shut down the OrientDB without stopping the entire application, it keeps receiving (and apparently processing) alignment requests from other running instances, even after shutdown is reportedly complete:
I'm not sure if this is expected or not, but it does sound like a potentially risky proposition. After all, when the server is shutdown we would expect that all files are closed and we are technically able to launch a separate instance that would access those same files on this. Sounds like a sort of leak? |
The delay could be due to the fact that reconnection tries to connect to the same server waiting for the timeout rather than using the second server. I'm investigating on it. About the after shutdown activity it's a bug: I'm figuring out why that component is alive yet. About other synchronization problems please report me any reproducible weird behavior to fix it in short time. |
I've just fixed the post-shutdown behavior in hotfix-1.4.1 branch. Please let me know if anything similar happens again. |
Just fixed the delay on reconnection. Please let me know if works as expected. It's all in the branch "hotfix-1.4.1" |
Luca, things are looking better by the minute! The replication aspect is holding up so far, I'll let you know if we encounter more funny issues. The delays when failing over to the second server have mostly disappeared (talking to first server: 4ms; talking to second server: 40ms now, 30s before). I'm not vouching for the 4ms vs. 40ms times at this point, it may be due to something else entirely. All I know is that it's now good enough for me to start looking at other issues in the system. Finally the post-shutdown behavior, it doesn't look like it got fixed completely. Here's what happens when I shutdown the server with
Here you can see that our
More synchronous updates from another server:
More alignment requests:
Still trying to communicate with remote clients:
Sometimes succeeding?
So I think the server no longer sends alignment requests when it's shut down maybe, but it's still doing quite a few things that it probably shouldn't. On the other hand, these are much lower concerns to me than the replication problems we originally had, and I can't thank you enough for your help so far! |
Doubt: how do you shutdown OrientDB? Do you shut down the embedded server? Do you see this message when the engine is off "OrientDB Server is shutting down..." ? About the help thank you for the issues: our goal is providing a rock-solid replication and we need the help of users like you ;-) |
Extract from a previous message (lots of content, I know): Here's what happens when I shutdown the server with
|
Hi Luca, just want to be clear: is Also, do you want to close this issue and perhaps open other ones for the side issues like the shutdown clean-up? I don't want the issue to grow in different directions in such a way that it can never be closed, or that the derived issues are lost in the noise. |
If you start the Server in embedded mode you've to invoke the shutdown() method, otherwise the server is still active. |
Hi, could you test it again against 1.5.0-SNAPSHOT? We did a lot of improvement. |
Any chance to test it against 1.5? |
I'm going to close the ticket, if the problem persists please reopen/comment this. |
OrientDB release
OrientDB Graph Edition 1.4.0 final
What steps will reproduce the problem?
c:\orient1\
andc:\orient2\
(two copies of the entire zip contents)cd
to thebin
directories of both software copies and rundserver
. The intent is to have two clustered servers running. Ensure that the log shows that the two see each other through HazelcastOops, I guess when using SQL syntax I need to declare the edge type even though I'll be using only lightweight edges; let me fix that quickly
(Hummm... I guess it's trying to tell me that the "record" class already exists... Even though it didn't just 2 seconds ago. Checking in Studio, it does indeed exist now. Moving on...)
So at this point we have a "central" vertex #9:809 that points to other vertices (we have two for now) via "record" edges. Everything is perfectly replicated between the two servers. So far so good. Now we want to stop the second server (simulating a partial system failure), continue working with the first (we're redundant, right?) until the "failed" server is restarted, at which point it is supposed to re-align.
Stop the second server. As per the documentation, in Windows the best way to achieve this is to close the command shell. Once stopped, continue in the console:
You can verify that everything is created properly in the one functional server, then restart the server that you previously stopped.
What is the expected output?
What we would expect, and what partially works, is that the restarted server realigns on start-up, and as a results gets the new vertex and edge that were added while it was down. There should be no exception logged on the server (and if there are, the user should be able to know about the issue without freezing?). After the initial resync, standard replication between the two servers should resume. Any connected client should not see any major impact, other than maybe a short delay while the servers realign. They should be allowed to read and write to the database soon after the server is restarted.
What do you see instead?
While replaying the log of events that occurred since the server was stopped, the
OHazelcastPlugin
createsOConcurrentModificationException
s that the client application never sees and can't do anything about. They keep occurring periodically and forever (they can't get resolved on their own but we have no way to resolve them either). Typically what you will see in the second server is that the data has been partially replicated (up to the point where it encountered the exception I guess). If you're lucky everything will have been replicated, but still there is another issue.As soon as you write anything to the database, your client will lock up forever (or until the request times out if it was invoked with a timeout).
Here is the beginning of the log of the server as it restarts:
In this case (but not always), when I looked at the Studio on the second server everything was replicated as expected, for vertices 809, 810, 811 and 812. However,the server is actually in a pretty bad state at that point. As long as you're just reading from the database, you're fine. As soon as you try writing anything, it will lock up pretty solid. Back to the console to demonstrate:
This update command never returns!
Custom settings
None, this is all reproducible with a brand new installation, default configuration and the dserver launch script. The steps above are Windows-oriented, but we get the exact same results under Linux and when the two processes are on two physical servers.
The text was updated successfully, but these errors were encountered: