Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orientdb 3.0 - Problem with restarts when db is closed abruptly #7489

Closed
careerscale opened this issue Jun 13, 2017 · 21 comments
Closed

Orientdb 3.0 - Problem with restarts when db is closed abruptly #7489

careerscale opened this issue Jun 13, 2017 · 21 comments
Assignees
Labels
Milestone

Comments

@careerscale
Copy link

OrientDB Version: 3.0

Java Version: 1.8

OS: Win 7

Expected behavior

Database should stabilize and recover from any failures that might have caused because of abrupt close of db

Actual behavior

I think I have no idea why some times if db is closed (command window is closed) without control C, it doesn't work.

This never comes out of it. If it i doing some processing like fixing indexes, it must show status and let user be aware of it. going non responsive doesn't make sense at all.

2017-06-12 09:02:41:257 WARNI Record OFuzzyCheckpointStartRecord{lsn=OLogSequenceNumber{segment=0, position=16167388}} com.orientechnologies.orient.core.stora
ge.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=OLogSequenceNumber{segment=0, position=16167341}} will be skipped during 
data restore [OLocalPaginatedStorage]                                                                                                                         
2017-06-12 09:02:41:257 WARNI Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=OLogSequenceNumber{segme
nt=0, position=16167428}} will be skipped during data restore [OLocalPaginatedStorage]                                                                        
2017-06-12 09:02:41:258 WARNI Record OFuzzyCheckpointStartRecord{lsn=OLogSequenceNumber{segment=0, position=16167435}} com.orientechnologies.orient.core.stora
ge.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=OLogSequenceNumber{segment=0, position=16167388}} will be skipped during 
data restore [OLocalPaginatedStorage]                                                                                                                         
2017-06-12 09:02:41:259 WARNI Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=OLogSequenceNumber{segme
nt=0, position=16167475}} will be skipped during data restore [OLocalPaginatedStorage]                                                                        
2017-06-12 09:02:41:260 WARNI Record OFuzzyCheckpointStartRecord{lsn=OLogSequenceNumber{segment=0, position=16167482}} com.orientechnologies.orient.core.stora
ge.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=OLogSequenceNumber{segment=0, position=16167435}} will be skipped during 
data restore [OLocalPaginatedStorage]                                                                                                                         
2017-06-12 09:02:41:261 WARNI Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=OLogSequenceNumber{segme
nt=0, position=16167522}} will be skipped during data restore [OLocalPaginatedStorage]                                                                        
2017-06-12 09:02:41:261 WARNI Record OFuzzyCheckpointStartRecord{lsn=OLogSequenceNumber{segment=0, position=16167529}} com.orientechnologies.orient.core.stora
ge.impl.local.paginated.wal.OFuzzyCheckpointStartRecord{lsn=null, previousCheckpoint=OLogSequenceNumber{segment=0, position=16167482}} will be skipped during 
data restore [OLocalPaginatedStorage]                                                                                                                         
2017-06-12 09:02:41:262 WARNI Record com.orientechnologies.orient.core.storage.impl.local.paginated.wal.OFuzzyCheckpointEndRecord{lsn=OLogSequenceNumber{segme
nt=0, position=16167569}} will be skipped during data restore [OLocalPaginatedStorage]                                                                        
2017-06-12 09:02:41:263 INFO  Storage data recover was completed [OLocalPaginatedStorage]                                                                     
2017-06-12 09:02:44:302 INFO  {db=mlm} Wait till indexes restore after crash was finished. [OIndexManagerShared]       


Steps to reproduce

Try having active databases created and used and then try killing db few times while active transactions are going on and try restart. you may be able to reproduce.

@careerscale
Copy link
Author

@luigidellaquila FYI

@andrii0lomakin
Copy link
Member

It should be fixed once we merge micro transactions which should happen during this week

@andrii0lomakin andrii0lomakin added this to the 3.0 milestone Jun 13, 2017
@careerscale
Copy link
Author

@Laa
Thanks for the update. eagerly waiting for the merge into 3.0

@vsspt
Copy link

vsspt commented Jun 13, 2017

I verified the same problem, was actually going to open an issue for this.
This is the error i get :

com.orientechnologies.orient.core.index.OIndexException: Error on rebuilding the index for clusters: [test_file_store_3, test_file_store_2, test_file_store, test_file_store_1] DB name="testBD" --> com.orientechnologies.common.io.OIOException: You cannot access outside the file size (65536 bytes). You have requested portion 1024-66560 bytes. File: File: test_file_store_3.cpm os-size=1024, stored=65536

@kindofwhat
Copy link

if it helps: run the following (kotlin) script and kill the process anytime
`

import org.apache.tinkerpop.gremlin.orientdb.OrientGraphFactory

fun main(args: Array) {

val o = OrientGraphFactory("plocal:C:/windows/temp/testdb").tx

for (i in 0..1000) {
    o.addVertex("label")
    println("now would be a good time to kill the process")
    Thread.sleep(5000)
}

}
`

@careerscale
Copy link
Author

restarting the terminal and running server.bat again helped us in windows.

@andrii0lomakin
Copy link
Member

it should be already fixed in 3.0 version after we merged micro-txs and fixed restore procedure.
Could you check the latest snapshot?

@careerscale
Copy link
Author

@Laa

"restarting the terminal and running server.bat again helped us in windows." this still holds good for me.

but when I hit the issue crash with indexes, I still see the same problem

@andrii0lomakin
Copy link
Member

@careerscale what version do you use? In the latest snapshot indexes are not rebuilt on startup. Do you use the version built from sources?

@careerscale
Copy link
Author

it is the latest snapshot.

@vsspt
Copy link

vsspt commented Jul 3, 2017

I've used the 3.0.0m1 version.

@andrii0lomakin
Copy link
Member

@careerscale could you provide an output of the log which you get after the crash and restore, do you use Lucene indexes? Could you also send us thread dump at the moment when index rebuild is frozen? Could you provide thread dump as separate gist?

@careerscale
Copy link
Author

careerscale commented Jul 4, 2017

I am not using anything specific with Lucene.

Here is the log. latest with July 4th build
https://gist.github.com/laa/4c5bf854052f5b3e0596edb619d48cc1

@careerscale
Copy link
Author

@Laa , threaddump here, I think it should be straight forward to reproduce, I just stop my application and start again.
https://gist.github.com/careerscale/be46b284519ad1a0da0d51900ff08b48

@careerscale
Copy link
Author

@Laa, do you need anything else here?

@andrii0lomakin
Copy link
Member

andrii0lomakin commented Jul 5, 2017

Hi @careerscale , no. Right now the problem is clear. It is not related to storage restore process itself but is related to new database lifecycle

in this thread

	at com.orientechnologies.orient.core.index.OIndexManagerShared.waitTillIndexRestore(OIndexManagerShared.java:324)
	at com.orientechnologies.orient.core.index.OIndexManagerShared.recreateIndexes(OIndexManagerShared.java:303)
	at com.orientechnologies.orient.core.index.OIndexManagerProxy.recreateIndexes(OIndexManagerProxy.java:200)
	at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.initAtFirstOpen(ODatabaseDocumentEmbedded.java:352)
	at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.internalOpen(ODatabaseDocumentEmbedded.java:125)
	at com.orientechnologies.orient.core.db.document.ODatabaseDocumentEmbedded.internalOpen(ODatabaseDocumentEmbedded.java:110)
	at com.orientechnologies.orient.core.db.OrientDBEmbedded.open(OrientDBEmbedded.java:129)

lock on OrientDBEmbedded instance is taken

and in this thread

"OrientDB rebuild indexes" #57 daemon prio=5 os_prio=0 tid=0x0000000028900800 nid=0x6214 waiting for monitor entry [0x000000002d6de000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.orientechnologies.orient.core.db.OrientDBEmbedded.openNoAuthenticate(OrientDBEmbedded.java:104)
	- waiting to lock <0x00000006c594b548> (a com.orientechnologies.orient.core.db.OrientDBEmbedded)
	at com.orientechnologies.orient.server.OServer.openDatabase(OServer.java:841)
	at com.orientechnologies.orient.server.OSystemDatabase.openSystemDatabase(OSystemDatabase.java:90)

we wait for the same lock. But first thread waits for second thread. So we have got dealock.
It is not related to storage restore procedure but to the new lifecycle so assigned to @tglman .

@careerscale
Copy link
Author

Thanks @Laa for the analysis. @tglman when you resolve the issue, please do let me know so that I can validate it.

@tglman
Copy link
Member

tglman commented Jul 6, 2017

hi @careerscale,

Just pushed a fix that should solve the problem.

Regards

@careerscale
Copy link
Author

careerscale commented Jul 6, 2017

@tglman, threaddump that I took sometime before. https://gist.github.com/careerscale/7f98b58d63ee5815594817db3c906e87 Looks like it didnt get the fix you made (my computer didnt pull the snapshots, may be it is because it has slightly older ones

@andrii0lomakin
Copy link
Member

Guys, please do not put huge thread dumps inside of issue itself, because after that it is quite inconvenient to track history of such issues. We always can embed big amount of texts inside gist service.

@careerscale
Copy link
Author

I think this was older snapshot. It seems good with snapshot no. 708.

@robfrank robfrank modified the milestones: 3.0, 3.0.0-M2 Jul 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

7 participants