Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication breaks down after vertices and edges added while one of the servers is down #1520

Closed
w3aponx opened this issue Jun 12, 2013 · 17 comments
Assignees
Milestone

Comments

@w3aponx
Copy link
Contributor

w3aponx commented Jun 12, 2013

OrientDB release

OrientDB Graph Edition 1.4.0 final

What steps will reproduce the problem?

  1. Download OrientDB Graph Edition 1.4.0 from orientdb.org
  2. Unzip in c:\orient1\ and c:\orient2\ (two copies of the entire zip contents)
  3. Launch a few command prompts, cd to the bin directories of both software copies and run dserver. The intent is to have two clustered servers running. Ensure that the log shows that the two see each other through Hazelcast
  4. In another command window, launch a console. We'll use it to talk to the first instance, but at every step of the way you can verify the results on both servers using Studio (port 2480 and 2481, respectively) (commands are in bold):

orientdb> connect remote:localhost/tinkerpop admin admin
Connecting to database [remote:localhost/tinkerpop] with user 'admin'...OK

orientdb> create vertex
Created vertex 'V#9:809 v0' in 0.097000 sec(s).

orientdb> create vertex
Created vertex 'V#9:810 v0' in 0.045000 sec(s).

orientdb> create edge from #9:809 to #9:810
Error: com.orientechnologies.orient.core.exception.OCommandExecutionException: > Error on execution of command: sql.create edge from #9:809 to #9:810
Error: java.lang.IllegalArgumentException: Edge label can not be null
Note: according to https://github.com/nuvolabase/orientdb/wiki/SQL-Create-Edge#create-a-new-edge-between-two-vertices-of-the-base-class-e-namely-ographedge this should work; I guess the documentation should be updated?

orientdb> create edge record from #9:809 to #9:810
Error: com.orientechnologies.orient.core.sql.OCommandSQLParsingException: Error on parsing command at position #0: Error on parsing command at position #0: Class RECORD was not found

Oops, I guess when using SQL syntax I need to declare the edge type even though I'll be using only lightweight edges; let me fix that quickly

orientdb> create class record extends E
Error: com.orientechnologies.orient.server.distributed.ODistributedException: Error on executing remote operation in SYNCHRONOUS mode against node: Member [192.168.2.27]:2435
Error: com.orientechnologies.common.exception.OException: java.util.concurrent.ExecutionException: com.orientechnologies.orient.server.distributed.ODistributedException: Error on executing remote operation in SYNCHRONOUS mode against node: Member [192.168.2.27]:2434
Error: java.util.concurrent.ExecutionException: com.orientechnologies.orient.server.distributed.ODistributedException: Error on executing remote operation in SYNCHRONOUS mode against node: Member [192.168.2.27]:2434
Error: com.orientechnologies.orient.server.distributed.ODistributedException: Error on executing remote operation in SYNCHRONOUS mode against node: Member [192.168.2.27]:2434
Error: com.orientechnologies.common.exception.OException: java.util.concurrent.ExecutionException: com.orientechnologies.orient.core.exception.OCommandExecutionException: Class record already exists
Error: java.util.concurrent.ExecutionException: com.orientechnologies.orient.core.exception.OCommandExecutionException: Class record already exists
Error: com.orientechnologies.orient.core.exception.OCommandExecutionException: Class record already exists
Error: com.orientechnologies.orient.core.exception.OCommandExecutionException: Class record already exists

(Hummm... I guess it's trying to tell me that the "record" class already exists... Even though it didn't just 2 seconds ago. Checking in Studio, it does indeed exist now. Moving on...)

orientdb> create edge record from #9:809 to #9:810
Created edge '[{in:#9:810,out:#9:809}]' in 0.083000 sec(s).

orientdb> create vertex
Created vertex 'V#9:811 v0' in 0.044000 sec(s).

orientdb> create edge record from #9:809 to #9:811
Created edge '[{in:#9:811,out:#9:809}]' in 0.095000 sec(s).

So at this point we have a "central" vertex #9:809 that points to other vertices (we have two for now) via "record" edges. Everything is perfectly replicated between the two servers. So far so good. Now we want to stop the second server (simulating a partial system failure), continue working with the first (we're redundant, right?) until the "failed" server is restarted, at which point it is supposed to re-align.

Stop the second server. As per the documentation, in Windows the best way to achieve this is to close the command shell. Once stopped, continue in the console:

orientdb> create vertex
Created vertex 'V#9:812 v0' in 0.022000 sec(s).

orientdb> create edge record from #9:809 to #9:812
Created edge '[{in:#9:812,out:#9:809}]' in 0.041000 sec(s).

You can verify that everything is created properly in the one functional server, then restart the server that you previously stopped.

What is the expected output?

What we would expect, and what partially works, is that the restarted server realigns on start-up, and as a results gets the new vertex and edge that were added while it was down. There should be no exception logged on the server (and if there are, the user should be able to know about the issue without freezing?). After the initial resync, standard replication between the two servers should resume. Any connected client should not see any major impact, other than maybe a short delay while the servers realign. They should be allowed to read and write to the database soon after the server is restarted.

What do you see instead?

While replaying the log of events that occurred since the server was stopped, the OHazelcastPlugin creates OConcurrentModificationExceptions that the client application never sees and can't do anything about. They keep occurring periodically and forever (they can't get resolved on their own but we have no way to resolve them either). Typically what you will see in the second server is that the data has been partially replicated (up to the point where it encountered the exception I guess). If you're lucky everything will have been replicated, but still there is another issue.

As soon as you write anything to the database, your client will lock up forever (or until the request times out if it was invoked with a timeout).

Here is the beginning of the log of the server as it restarts:

C:\orient2\orientdb-graphed-1.4.0\bin>dserver

[logo omitted] .

2013-06-11 21:10:13:726 INFO Profiler is recording metrics with configuration: 0,0,0 [OJVMProfiler]
2013-06-11 21:10:13:747 INFO Loading configuration from: C:/dev/orientdb-graphed-1.4.0b/config/orientdb-dserver-config.xml... [OServerConfigurationLoaderXml]
2013-06-11 21:10:14:137 INFO OrientDB Server v1.4.0 (build 268) is starting up... [OServer]
2013-06-11 21:10:15:439 INFO -> Loaded memory database 'temp' [OServer]
2013-06-11 21:10:15:488 INFO Port 0.0.0.0:2424 busy, trying the next available... [OServerNetworkListener]
2013-06-11 21:10:15:492 INFO Listening binary connections on 0.0.0.0:2425 [OServerNetworkListener]
2013-06-11 21:10:15:496 INFO Port 0.0.0.0:2480 busy, trying the next available... [OServerNetworkListener]
2013-06-11 21:10:15:500 INFO Listening http connections on 0.0.0.0:2481 [OServerNetworkListener]
2013-06-11 21:10:15:512 INFO Installing GREMLIN language v.2.4.0-SNAPSHOT [OGraphServerHandler]
2013-06-11 21:10:15:562 INFO Configuring Hazelcast from 'C:\dev\orientdb-graphed-1.4.0b\config\hazelcast.xml'. [FileSystemXmlConfig]
2013-06-11 21:10:16:281 INFO [192.168.2.27]:2435 [orientdb] Hazelcast Community Edition 2.1.2 (20120530) starting at Address[192.168.2.27]:2435 [system]
2013-06-11 21:10:16:283 INFO [192.168.2.27]:2435 [orientdb] Copyright (C) 2008-2012 Hazelcast.com [system]
2013-06-11 21:10:16:541 INFO [192.168.2.27]:2435 [orientdb] Address[192.168.2.27]:2435 is STARTING [LifecycleServiceImpl]
2013-06-11 21:10:16:677 INFO [192.168.2.27]:2435 [orientdb] 63406 accepted socket connection from /192.168.2.27:2434 [ConnectionManager]
2013-06-11 21:10:22:202 INFO [192.168.2.27]:2435 [orientdb]

Members [2] {
Member [192.168.2.27]:2434
Member [192.168.2.27]:2435 this
}
[ClusterManager]
2013-06-11 21:10:23:745 INFO [192.168.2.27]:2435 [orientdb] Address[192.168.2.27]:2435 is STARTED [LifecycleServiceImpl]
2013-06-11 21:10:23:756 WARN DISTRIBUTED <> opening database tinkerpop... [OHazelcastPlugin]
2013-06-11 21:10:23:763 WARN DISTRIBUTED Initializing storage 'tinkerpop' [OStorageSynchronizer]
2013-06-11 21:10:24:103 WARN DISTRIBUTED <> opening database temp... [OHazelcastPlugin]
2013-06-11 21:10:24:159 WARN DISTRIBUTED <> updated node status to 'aligning' for 192.168.2.27:2435 [OHazelcastPlugin]
2013-06-11 21:10:24:168 WARN DISTRIBUTED -> detected running nodes [192.168.2.27:2434] [OHazelcastPlugin]
2013-06-11 21:10:24:175 WARN DISTRIBUTED <> updated node status to 'aligning' for 192.168.2.27:2435 [OHazelcastPlugin]
2013-06-11 21:10:24:181 WARN DISTRIBUTED --> send align request in broadcast for database tinkerpop from 1370998214714:51 [OHazelcastPlugin]
2013-06-11 21:10:24:185 INFO DISTRIBUTED ->[192.168.2.27:2434/tinkerpop] setting node in alignment state [OHazelcastPlugin]
2013-06-11 21:10:24:199 WARN DISTRIBUTED --> send align request in broadcast for database temp from -1:-1 [OHazelcastPlugin]
2013-06-11 21:10:24:203 INFO DISTRIBUTED ->[192.168.2.27:2434/temp] setting node in alignment state [OHazelcastPlugin]
2013-06-11 21:10:24:236 INFO Mail plugin installed and active. Loaded 1 profile(s): [default] [OMailPlugin]
2013-06-11 21:10:24:243 INFO OrientDB Server v1.4.0 is active. [OServer]
2013-06-11 21:10:24:249 WARN DISTRIBUTED <-[192.168.2.27:2434/temp] alignment ended: 0 operation(s) [OAlignResponseDistributedTask]
2013-06-11 21:10:24:255 WARN DISTRIBUTED <-[192.168.2.27:2434/tinkerpop] executing group of 3 command(s) [OMultipleDistributedTasks]
2013-06-11 21:10:24:260 WARN Journaled operation RECORD_CREATE #9:812 as #1370998214714.67 [ODatabaseJournal]
2013-06-11 21:10:24:265 WARN DISTRIBUTED <-[192.168.2.27:2434/tinkerpop] CREATE RECORD #9:812 v.1 [OCreateRecordDistributedTask]
2013-06-11 21:10:24:293 WARN Updating status operation #1370998214714.67 rid #9:812 [ODatabaseJournal]
2013-06-11 21:10:24:302 WARN Journaled operation RECORD_UPDATE #9:809 as #1370998214714.68 [ODatabaseJournal]
2013-06-11 21:10:24:305 WARN DISTRIBUTED <-[192.168.2.27:2434/tinkerpop] UPDATE RECORD #9:809 v.2 [OUpdateRecordDistributedTask]
2013-06-11 21:10:24:318 WARN Updating status operation #1370998214714.68 rid null [ODatabaseJournal]
2013-06-11 21:10:24:322 WARN Journaled operation RECORD_UPDATE #9:812 as #1370998214714.69 [ODatabaseJournal]
2013-06-11 21:10:24:325 WARN DISTRIBUTED <-[192.168.2.27:2434/tinkerpop] UPDATE RECORD #9:812 v.0 [OUpdateRecordDistributedTask]DISTRIBUTED -> error on execution of operation in ASYNCH mode against node: Member [192.168.2.27]:2434
com.orientechnologies.orient.server.distributed.ODistributedException: Error on executing remote operation in SYNCHRONOUS mode against node: Member [192.168.2.27]:2435
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin$1.done(OHazelcastPlugin.java:199)
-> com.hazelcast.impl.ExecutorManager$3.run(ExecutorManager.java:355)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
-> java.lang.Thread.run(Unknown Source)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
Error on executing remote operation in SYNCHRONOUS mode against node: Member [192.168.2.27]:2435
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin$1.done(OHazelcastPlugin.java:199)
-> com.hazelcast.impl.ExecutorManager$3.run(ExecutorManager.java:355)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
-> java.lang.Thread.run(Unknown Source)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
com.orientechnologies.orient.core.exception.OConcurrentModificationException: Cannot UPDATE the record #9:812 because the version is not the latest. Probably you are updating an old record or it has been modified by another user (db=v1 your=v0)
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin$1.done(OHazelcastPlugin.java:199)
-> com.hazelcast.impl.ExecutorManager$3.run(ExecutorManager.java:355)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
-> java.lang.Thread.run(Unknown Source)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
Cannot UPDATE the record #9:812 because the version is not the latest. Probably you are updating an old record or it has been modified by another user (db=v1 your=v0)
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin$1.done(OHazelcastPlugin.java:199)
-> com.hazelcast.impl.ExecutorManager$3.run(ExecutorManager.java:355)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
-> java.lang.Thread.run(Unknown Source)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
Cannot UPDATE the record #9:812 because the version is not the latest. Probably you are updating an old record or it has been modified by another user (db=v1 your=v0)
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin$1.done(OHazelcastPlugin.java:199)
-> com.hazelcast.impl.ExecutorManager$3.run(ExecutorManager.java:355)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
-> java.lang.Thread.run(Unknown Source)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
2013-06-11 21:12:14:751 WARN DISTRIBUTED <-[192.168.2.27:2434/tinkerpop] align request starting from 1370998214714.69 [OAlignRequestDistributedTask]

In this case (but not always), when I looked at the Studio on the second server everything was replicated as expected, for vertices 809, 810, 811 and 812. However,the server is actually in a pretty bad state at that point. As long as you're just reading from the database, you're fine. As soon as you try writing anything, it will lock up pretty solid. Back to the console to demonstrate:

orientdb> update #9:809 set toto="yes"

This update command never returns!

Custom settings

None, this is all reproducible with a brand new installation, default configuration and the dserver launch script. The steps above are Windows-oriented, but we get the exact same results under Linux and when the two processes are on two physical servers.

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 12, 2013

After the second server is restarted, here is what the cluster status is:

{
    "@type":"d","@version":0,
    "members":[{
        "@type":"d","@version":0,
        "id":"10.0.24.166:2434",
        "listeners":[{"protocol":"ONetworkProtocolBinary","listen":"10.0.24.166:2424"},
                     {"protocol":"ONetworkProtocolHttpDb","listen":"10.0.24.166:2480"}],
        "alias":"10.0.24.166:2434","status":"online"
    },{
        "@type":"d","@version":0,
        "id":"10.0.24.166:2435",
        "listeners":[{"protocol":"ONetworkProtocolBinary","listen":"10.0.24.166:2425"},
                     {"protocol":"ONetworkProtocolHttpDb","listen":"10.0.24.166:2481"}],
        "alias":"10.0.24.166:2435",
        "status":"aligning"
    }],
    "name":"_hzInstance_1_orientdb",
    "local":"10.0.24.166:2434"}

In other words, the server remains stuck "aligning" forever.

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 14, 2013

Here's the kind of error you get in the first server (the one that was online all along) just after the "failed" server restarts. Note that this log was taken in a different test run, and running 1.4.1-SNAPSHOT (ccd9b3b), but the pattern is the same every time on each server:

2013-06-14 13:41:26:009 INFO [10.0.24.166]:2434 [orientdb] 2434 is accepting socket connection from /10.0.24.166:56876 [SocketAcceptor]
2013-06-14 13:41:26:009 INFO [10.0.24.166]:2434 [orientdb] 2434 accepted socket connection from /10.0.24.166:56876 [ConnectionManager]
2013-06-14 13:41:31:521 INFO [10.0.24.166]:2434 [orientdb]

Members [2] {
Member [10.0.24.166]:2434 this
Member [10.0.24.166]:2435
}
[ClusterManager]
2013-06-14 13:41:31:540 INFO [10.0.24.166]:2434 [orientdb] Re-partitioning cluster data... Immediate-Tasks: 271, Scheduled-Tasks: 0 [PartitionManager]
2013-06-14 13:41:33:460 INFO Sent updated cluster configuration to the remote client 127.0.0.1:56708 [OClientConnectionManager]
2013-06-14 13:41:33:470 INFO Sent updated cluster configuration to the remote client 127.0.0.1:56708 [OClientConnectionManager]
2013-06-14 13:41:33:476 INFO DISTRIBUTED <-[10.0.24.166:2435/temp] align request starting from -1.-1 [OAlignRequestDistributedTask]
2013-06-14 13:41:33:479 INFO DISTRIBUTED ->[10.0.24.166:2435/temp] aligned 0 operations [OAlignRequestDistributedTask]
2013-06-14 13:41:33:480 INFO DISTRIBUTED <-[10.0.24.166:2435/MirandaDS] align request starting from 1371227209096.271 [OAlignRequestDistributedTask]
2013-06-14 13:41:33:481 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_create(#18:2 v.4) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:482 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:482 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:482 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:482 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#13:0 v.7) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:483 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:483 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_create(#18:3 v.4) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:483 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:3 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:483 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:3 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:483 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:3 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:484 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#13:0 v.7) [OAlignRequestDistributedTask]
2013-06-14 13:41:33:484 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:3 v.3) [OAlignRequestDistributedTask]DISTRIBUTED -> error on execution of operation in SYNCHRONOUS mode
com.orientechnologies.orient.core.exception.OConcurrentModificationException: Cannot UPDATE the record #18:2 because the version is not the latest. Probably you are updating an old record or it has been modified by another user (db=v4 your=v3)
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.executeOperation(OHazelcastPlugin.java:707)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.sendOperation2Node(OHazelcastPlugin.java:211)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.flushBufferedTasks(OAlignRequestDistributedTask.java:128)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.call(OAlignRequestDistributedTask.java:106)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.call(OAlignRequestDistributedTask.java:1)
-> com.hazelcast.impl.ExecutorManager$RequestExecutor.run(ExecutorManager.java:224)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
-> java.lang.Thread.run(Thread.java:722)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
Cannot UPDATE the record #18:2 because the version is not the latest. Probably you are updating an old record or it has been modified by another user (db=v4 your=v3)
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.executeOperation(OHazelcastPlugin.java:707)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.sendOperation2Node(OHazelcastPlugin.java:211)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.flushBufferedTasks(OAlignRequestDistributedTask.java:128)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.call(OAlignRequestDistributedTask.java:106)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.call(OAlignRequestDistributedTask.java:1)
-> com.hazelcast.impl.ExecutorManager$RequestExecutor.run(ExecutorManager.java:224)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
-> java.lang.Thread.run(Thread.java:722)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)
Cannot UPDATE the record #18:2 because the version is not the latest. Probably you are updating an old record or it has been modified by another user (db=v4 your=v3)
-> com.hazelcast.core.DistributedTask.get(DistributedTask.java:126)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.executeOperation(OHazelcastPlugin.java:707)
-> com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.sendOperation2Node(OHazelcastPlugin.java:211)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.flushBufferedTasks(OAlignRequestDistributedTask.java:128)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.call(OAlignRequestDistributedTask.java:106)
-> com.orientechnologies.orient.server.task.OAlignRequestDistributedTask.call(OAlignRequestDistributedTask.java:1)
-> com.hazelcast.impl.ExecutorManager$RequestExecutor.run(ExecutorManager.java:224)
-> com.hazelcast.impl.executor.ParallelExecutorService$ParallelExecutorImpl$ExecutionSegment.run(ParallelExecutorService.java:212)
-> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
-> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
-> java.lang.Thread.run(Thread.java:722)
-> com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:38)

The thing that looks very suspicious to me is the fact that when the operations are replayed, the version number of the record being updated jumps back and forth between v4 and v3 (it gets created as v4, then updated as v3 many times, etc.). I would have expected either that the operations are replayed in sequence (first created as v1 (or is it v0?)), then updated to v2, v3 and finally v4) or that all the changes are bundled into a single update (create record as v4, with all properties modified in previous steps included).

It would be really nice to have some sort of an evaluation here, so we know at least what the expected behavior is, if and when it could be fixed, etc. Or if we're on our own then just let us know, at least we'll know where we stand and can make appropriate decisions based on complete and accurate information. Thanks!

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 14, 2013

After that, the re-alignment attempts keep occurring periodically (every two minutes), but starting from the second one we can see that it skips the first record_create operation since it succeeded on the first attempt:

2013-06-14 13:45:33:496 INFO DISTRIBUTED <-[10.0.24.166:2435/MirandaDS] align request starting from 1371227209096.275 [OAlignRequestDistributedTask]
2013-06-14 13:45:33:499 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:45:33:500 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:45:33:500 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
2013-06-14 13:45:33:501 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#13:0 v.7) [OAlignRequestDistributedTask]
2013-06-14 13:45:33:501 INFO DISTRIBUTED ->[10.0.24.166:2435/MirandaDS] operation record_update(#18:2 v.3) [OAlignRequestDistributedTask]
[...]

The subsequent operations however never succeed (even the ones that are theoretically independent from the one that fails).

@lvca
Copy link
Member

lvca commented Jun 17, 2013

We fixed 4 issues on replication in the last days. Could you retry against the 1.4.1-SNAPSHOT please?

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 17, 2013

Thanks for the update Luca.

It does look much better now, although on my first test run I encountered some bizarre results. I had the correct total edge count on both servers, but some of the edges were created as documents instead of properties. Semantically this is equivalent so by itself it would be acceptable, but the problem is that all the edges created this way were incorrectly pointing to vertex #18:0 instead of new vertices like #18:4, #18:5 and so on. I'm not sure if this was caused by the particular timing or sequence of operations, a bug in the test code or what. One particularity of this degenerate test is that the different vertices all have the exact same properties and values (aside from their identity), so if some OrientDB code is hoping to differentiate them by value (I doubt it's the case but thought I would mention it anyway), it won't work.

After re-running some scenarios where I double-check the state of the client and various servers at every step of the way, however, I was able to verify that in most cases at least the end result is correct.

One thing I'd like to mention however is that when the "primary" server (the one the application initially connects to) is down, and we automatically fall back to the "secondary" server, everything slows down considerably (going from a few milliseconds to something like 30 seconds to re-read a table containing about 5 vertices with 3 properties each, for example). Now granted, we do force reload of vertices explicitly in our code to bypass the cache (for now), which might not be optimal. I have yet to investigate this further, but wanted to mention it so you have a heads-up; maybe you'll know what causes this. It does feel like every operation tries to talk to the primary server first, even when we could (or should) know it's down already, and has to timeout before falling back to the second server. The additional delays appear to be proportional to the number of vertices that the table in question contains.

So, thanks again and I'll keep you posted regarding the progress of my investigation.

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 17, 2013

One more thing I forgot to mention... You may recall that we're running OrientDB embedded in our server application. We noticed that when we shut down the OrientDB without stopping the entire application, it keeps receiving (and apparently processing) alignment requests from other running instances, even after shutdown is reportedly complete:

2013-06-17 12:00:36:762 INFO Orient Engine shutdown complete
[Orient]
2013-06-17 12:01:27:080 WARN DISTRIBUTED <> updated node status to 'aligning' for 10.0.24.166:2435 [OHazelcastPlugin]
2013-06-17 12:01:27:081 WARN DISTRIBUTED --> send align request in broadcast for database temp from -1:-1 [OHazelcastPlugin]
2013-06-17 12:01:27:081 INFO DISTRIBUTED ->[10.0.24.166:2434/temp] setting node in alignment state [OHazelcastPlugin]
2013-06-17 12:01:27:081 INFO Sent updated cluster configuration to the remote client 10.0.24.166:51761 [OClientConnectionManager]
2013-06-17 12:01:27:084 WARN DISTRIBUTED --> send align request in broadcast for database MirandaDS from 1371482718016:82 [OHazelcastPlugin]
2013-06-17 12:01:27:084 INFO DISTRIBUTED ->[10.0.24.166:2434/MirandaDS] setting node in alignment state [OHazelcastPlugin]
2013-06-17 12:01:27:085 INFO DISTRIBUTED <-[10.0.24.166:2434/temp] alignment ended: 0 operation(s) [OAlignResponseDistributedTask]
2013-06-17 12:01:27:088 INFO DISTRIBUTED <-[10.0.24.166:2434/MirandaDS] alignment ended: 0 operation(s) [OAlignResponseDistributedTask]
2013-06-17 12:01:27:089 WARN DISTRIBUTED <> updated node status to 'online' for 10.0.24.166:2435 [OHazelcastPlugin]
2013-06-17 12:01:27:090 INFO Sent updated cluster configuration to the remote client 10.0.24.166:51761 [OClientConnectionManager]

I'm not sure if this is expected or not, but it does sound like a potentially risky proposition. After all, when the server is shutdown we would expect that all files are closed and we are technically able to launch a separate instance that would access those same files on this. Sounds like a sort of leak?

@lvca
Copy link
Member

lvca commented Jun 17, 2013

The delay could be due to the fact that reconnection tries to connect to the same server waiting for the timeout rather than using the second server. I'm investigating on it. About the after shutdown activity it's a bug: I'm figuring out why that component is alive yet. About other synchronization problems please report me any reproducible weird behavior to fix it in short time.

@lvca
Copy link
Member

lvca commented Jun 17, 2013

I've just fixed the post-shutdown behavior in hotfix-1.4.1 branch. Please let me know if anything similar happens again.

@lvca
Copy link
Member

lvca commented Jun 17, 2013

Just fixed the delay on reconnection. Please let me know if works as expected. It's all in the branch "hotfix-1.4.1"

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 17, 2013

Luca, things are looking better by the minute!

The replication aspect is holding up so far, I'll let you know if we encounter more funny issues.

The delays when failing over to the second server have mostly disappeared (talking to first server: 4ms; talking to second server: 40ms now, 30s before). I'm not vouching for the 4ms vs. 40ms times at this point, it may be due to something else entirely. All I know is that it's now good enough for me to start looking at other issues in the system.

Finally the post-shutdown behavior, it doesn't look like it got fixed completely. Here's what happens when I shutdown the server with com.orientechnologies.orient.core.Orient.instance().shutdown():

2013-06-17 14:39:49,729 [HTTP-DB] DEBUG com.miranda.icontrol.ds.persistence.orientdb.OrientDataManager - ODatabaseListener.onClose - OrientDB[local:databases/MirandaDS](users: 2)

2013-06-17 14:39:49:729 INFO Shutting down storage: temp... [Orient]
2013-06-17 14:39:49:730 INFO Shutting down storage: MirandaDS... [Orient]
2013-06-17 14:39:49:781 INFO Orient Engine shutdown complete
[Orient]com.orientechnologies.common.concur.OTimeoutException: Timeout on acquiring exclusive lock against resource: /10.0.24.166:60683
at com.orientechnologies.common.concur.resource.OSharedResourceTimeout.acquireExclusiveLock(OSharedResourceTimeout.java:67)
at com.orientechnologies.common.concur.resource.OSharedResourceExternalTimeout.acquireExclusiveLock(OSharedResourceExternalTimeout.java:13)
at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.sendError(OBinaryNetworkProtocolAbstract.java:164)
at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.execute(OBinaryNetworkProtocolAbstract.java:136)
at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:43)

Here you can see that our ORecordHook triggers are still invoked after the DB is shut down:

2013-06-17 14:40:08,639 [OrientDB <- BinaryClient (/10.0.24.166:60998)] DEBUG c.m.i.ds.p.orientdb.ClientUpdateHook - onTrigger - BEFORE_READ - #0:1{schemaVersion:4,classes:[20]} v68

More synchronous updates from another server:

2013-06-17 14:40:54:425 WARN DISTRIBUTED -> routing operation RECORD_CREATE in SYNCHRONOUS mode against 10.0.24.166:2434/MirandaDS... [OHazelcastPlugin]
2013-06-17 14:40:54:436 INFO Journaled operation RECORD_CREATE #18:-1 as #1371493941148.98 [ODatabaseJournal]
2013-06-17 14:40:54:437 INFO DISTRIBUTED <-[10.0.24.166:2435/MirandaDS] CREATE RECORD #18:-1 v.0 [OCreateRecordDistributedTask]
2013-06-17 14:40:54,437 [OrientDB <- BinaryClient (/10.0.24.166:61089)] DEBUG c.m.i.ds.p.orientdb.ClientUpdateHook - onTrigger - BEFORE_CREATE - DSRecord
2013-06-17 14:40:54,437 [OrientDB <- BinaryClient (/10.0.24.166:61089)] DEBUG c.m.i.ds.p.orientdb.ClientUpdateHook - onTrigger - AFTER_CREATE - DSRecord#18:6 v0
2013-06-17 14:40:54:437 INFO Updating status operation #1371493941148.98 rid #18:6 [ODatabaseJournal]2013-06-17 14:40:54,438 [OrientDB <- BinaryClient (/10.0.24.166:61089)] DEBUG c.m.i.ds.p.orientdb.ClientUpdateHook - onTrigger - AFTER_CREATE - DSRecord#18:6 v0

More alignment requests:

2013-06-17 14:41:46:845 INFO DISTRIBUTED <-[10.0.24.166:2434/temp] align request starting from -1.-1 [OAlignRequestDistributedTask]
2013-06-17 14:41:46:846 INFO DISTRIBUTED ->[10.0.24.166:2434/temp] aligned 0 operations [OAlignRequestDistributedTask]

Still trying to communicate with remote clients:

2013-06-17 14:41:46:846 WARN Cannot push cluster configuration to the client 10.0.24.166:60683 [OClientConnectionManager]
2013-06-17 14:41:46:846 WARN Cannot push cluster configuration to the client 10.0.24.166:60683 [OClientConnectionManager]
2013-06-17 14:41:46:847 WARN Cannot push cluster configuration to the client 10.0.24.166:60998 [OClientConnectionManager]

Sometimes succeeding?

2013-06-17 14:41:46:873 INFO Sent updated cluster configuration to the remote client 10.0.24.166:61122 [OClientConnectionManager]

So I think the server no longer sends alignment requests when it's shut down maybe, but it's still doing quite a few things that it probably shouldn't.

On the other hand, these are much lower concerns to me than the replication problems we originally had, and I can't thank you enough for your help so far!

@lvca
Copy link
Member

lvca commented Jun 17, 2013

Doubt: how do you shutdown OrientDB? Do you shut down the embedded server? Do you see this message when the engine is off "OrientDB Server is shutting down..." ? About the help thank you for the issues: our goal is providing a rock-solid replication and we need the help of users like you ;-)

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 17, 2013

Extract from a previous message (lots of content, I know):

Here's what happens when I shutdown the server with com.orientechnologies.orient.core.Orient.instance().shutdown():

2013-06-17 14:39:49,729 [HTTP-DB] DEBUG com.miranda.icontrol.ds.persistence.orientdb.OrientDataManager - ODatabaseListener.onClose - OrientDB[local:databases/MirandaDS](users: 2)

2013-06-17 14:39:49:729 INFO Shutting down storage: temp... [Orient]
2013-06-17 14:39:49:730 INFO Shutting down storage: MirandaDS... [Orient]
2013-06-17 14:39:49:781 INFO Orient Engine shutdown complete

@w3aponx
Copy link
Contributor Author

w3aponx commented Jun 19, 2013

Hi Luca, just want to be clear: is Orient.instance().shutdown() sufficient or should we do something more/different?

Also, do you want to close this issue and perhaps open other ones for the side issues like the shutdown clean-up? I don't want the issue to grow in different directions in such a way that it can never be closed, or that the derived issues are lost in the noise.

@ghost ghost assigned lvca Jun 19, 2013
@lvca
Copy link
Member

lvca commented Jun 19, 2013

If you start the Server in embedded mode you've to invoke the shutdown() method, otherwise the server is still active.

@lvca
Copy link
Member

lvca commented Jul 11, 2013

Hi, could you test it again against 1.5.0-SNAPSHOT? We did a lot of improvement.

@lvca
Copy link
Member

lvca commented Jul 25, 2013

Any chance to test it against 1.5?

@lvca
Copy link
Member

lvca commented Jul 25, 2013

I'm going to close the ticket, if the problem persists please reopen/comment this.

@lvca lvca closed this as completed Jul 25, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants