-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlocks between writing threads #329
Comments
Hi, I have not tried the above query yet but does it consistently duplicate the dead lock? |
Yes, the countdown latches ensure we systematically replicate the deadlock. After 10 seconds thread 1 is sure thread2 has changed the topology and tries to acquire the lock. |
Ok, well its great that we can duplicate the dead lock now. I am not sure all sorts of tricks will help as it might just push the dead lock down to postgres itself. This is why I moved the lock to java in the first place, to force topology change statements to only ever happen in one thread/transaction at a time. I need to look a bit deeper at it. We might have to lock even more aggressively if the current strategy still leads to dead locks. Regarding rereading the topology. I'll be hesitant to do that as it quite slow. Our production dbs are broad, 20 000 - 40 000 tables. It takes a while to populate the topology from the db. |
I've had a chance too look at the dead lock a bit. Makes sense? Opinions, ideas? |
At first glance this seems to make perfect sense. Can we detect with 100% accuracy if we're writing something? I would suppose so. We need to handle the case when one thread commits then immediately continues writing, that is checks if there are some pending locks. I suppose writing transactions need to have a read lock on the topology lock |
Ok good, I'll give it a bash on the weekend. Fortunately we know when a write call is made via the api. |
I added a fix for this on the All the tests are passing including a new test in Please have a look to see if you agree with the implementation. Concurrency is a tad complex, so I am not entirely convinced the current solution does not have dead lock scenarios. I'll be doing some more tests during the week on my work code to see if all is good. |
Hi, my test now passes, but I'm not too sure about the code change. The await call has no termination, so the thread will wait for ever, no? Shouldn't there be a timeout? Also, if thread1 starts writing some data, then thread2 stars two but is locked out by postgres so has to wait for thread1 to commit, and thread1 decides to do a topology change, we'll have a deadlock again, no? Because thread1 will wait for thread2 to commit before taking the lock... |
Do we really need all this explicit locking? I mean Postgres does some locking for us, couldn't we use the fact that topology change impact postgres data inside the sqlg schema to ensure nobody writes on anybody's toes? I suppose this doesn't guarantee no deadlocks will happen in Postgres, which you were trying to avoid in the first place. |
I'll add the timeouts. Any opinion on a sensible default? On the topology lock I am currently going with a hardcoded 2 minutes. I'll make it configurable. Regarding your example, Regarding taking all these locks. I am not exactly happy with the recent refactor and adding of two extra locks. However before having the first topology lock postgres would dead lock all the time. Multi threaded topology change is not really a feature of postgres. I suspect part of the reason its more difficult on postgres is that the topology changes are actually transactional. For hsqldb, h2, mariadb and mysql schema changes are not transactional. They secretly commit the transaction immediately after executing the statement. Basically breaking the transaction contract. Regarding "I mean Postgres does some locking for us, couldn't we use the fact that topology change impact postgres data inside the sqlg schema to ensure nobody writes on anybody's toes?" |
Btw I also spent some time reading https://github.com/npgall/concurrent-locks. However it won't work for us as the update lock only allows one thread at a time. |
The cause for the deadlock is the acquisition if a lock in Java (let's call it J) and some lock in Postgres (may be called P, acquired implicitly by Postgres write requests) in varying order:
=> This pattern is known for causing deadlocks. T1 acquires J, T2 acquires P, T1 starts to wait for P, T2 starts to wait for J - deadlock. The assumption is here that a schema modification creates always a new SchemaData object which replaces the former one if the update is completed. Due to this the member Topology.schemaData refers always to a consistent schema data structure and there is no need for longer acquisition of a lock in Java. In reality it is a bit more complicated because you have also to detect in updateSchema() that concurrent schema updates have occurred and to merge them appropriately into the modifiedSchema before it is assigned to the member Topology.schemaData. Given that the notifications of schema updates by other JVMs can also be merged this should be doable (within the synchronized block in updateSchema). |
The problem that I see with the above is that the If I understand your solution correctly postgres schema changes will be able to occur simultaneously thus causing dead locks? |
Yes, Postgres schema changes may then be done concurrently by multiple threads and Postgres may detect deadlocks between the corresponding transactions. The question is if this causes additional problems in the applications? In see the following points why this shouldn't be the case:
The need to program the application so that an sqlg transaction is repeated if the commit fails is certainly inconvenient but we found it necessary also if the schema did no more need updates. Deadlocks because of schema updates may cause some additional commit exceptions but this should't really matter, given that schema updates are typically rare events. My bottom line is: This bug is about an unexpected deadlock, but the occurrence of deadlocks (or at least conflicts) between concurrent threads can't be completely avoided, you have to accept the need to handle them in the application code. This raises the question: |
Hmm, I suppose the initial topology lock was precisely to prevent the application from having to code retry logic. With topology changes it happens so often that it is a glorious pain to deal with it in the application. Regarding multiple JVMs the Sqlg, when in distributed mode, takes a My strategy is/was to only serialize topology changes with no further attempts at trying to control postgres locks. In a way its an attempt to get the TinkerPop's NoSql part to behave almost NoSql like. Unfortunately its throwing us a curve ball now. Regarding postgres detecting the dead lock. Does it sacrifice the transaction immediately? I have not had this problem for a while, the current Sqlg solution working for my particular use-case. What I recall however is postgres simply hanging with a bunch of pids waiting on other pids. |
I'll investigate removing the topology lock, (probably making it optional) as too much of my own and probably others depend on it, (not having needed any retry logic). But yes I agree, first prize is no attempt at locking from Sqlg's side. |
Did some testing, on the current way for context, First test is doing what Sqlg does in the second test using pure sql with no locking.
This test almost always throws a dead lock exception from postgres.
This one passes as the topology lock prevents the dead lock. It seems to me changing Sqlg to have no topology lock will have a serious impact on existing clients as chances are they are relying on the lock, having neither created schemas upfront nor coded retry logic. I am first going to try to duplicate dead locking the current implementation and see if there is a way to detect it and throw a dead lock exception faster than the current 2 minutes. This should not happen to often as its only when there is a topology change and still another write thread active. Then I'll investigate making the topology lock optional, or perhaps removing all together. Thought are most welcome. |
I added dead lock detection code on the topology. So if the topology lock can not be taken because a write thread is waiting on the current thread a dead lock exception is thrown. This happens in millis so is better than waiting 2 minutes for a timeout. I fixed the handling of timeouts, it was not throwing exceptions. |
Say, is this fix working for you? Any more dead locks? I'll create a separate ticket about refactoring the locking strategy to no longer take a schema change lock in java. |
Yes, from our end we're fine with this fix, we understand what it fixes and what it doesn't and how to work around the cases it does not, so it can be merged. |
Ok great tx, I'll create another ticket for removing the lock, but alas won't be working on it anytime soon. I pretended to start and quickly stopped as it affects the whole strategy. |
Unfortunately we ran into deadlock issues again. The setup is as follows: we have multiple threads writing into the DB using the same sqlgraph. Sometimes these write operations have to modify the underlying schema, and we get into a deadlock where one thread as done some uncommitted writes, thus holds some postgres locks, and then wants to do schema changes, so requires the topology locks. At the same time another thread has made some topology changes and got the lock, but it cannot write the data to postgres because the table is locked by the other thread. So one thread holds the DB locks and the other the Java locks, and they wait for the other.
Tests to demonstrate the issues
(vertex attributes)
and
(edges vertices)
One of my colleagues suggested replacing the topology lock by either immutable structures or transaction specific topology items, so that no topology lock would be required during the write transaction, and only when the transaction commits would the topology changes be visible to other threads (here we could take a lock to update the generic topology and ensure all changes that happened are properly merged with other changes that happened concurrently). I'm not sure how workable that is. But I suppose between distributed graphs we have no topology locks anyway.
One workaround is to have several graphs (one graph per thread, no sharing). Is that the way forward, or should we try to fix the locking?
Another workaround is to try to make all schema changes upfront before any write operation, but it's not sure in every use case we'll always the know the schema we need, and we may create a lot of attributes or edges that may not be needed (and end up with edge tables that are much wider than needed).
The text was updated successfully, but these errors were encountered: