-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remove race condition bug in refresh logic #1390
Conversation
} | ||
|
||
/** A constructor for use in tests that allows the caller to set the RateLimiter. */ | ||
CloudSqlInstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just update the constructor to pass in the rate limiter. We don't need two constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer 2 constructors. I don't want to expose a public method to construct the default rate limiter just so that CoreSocketFactory can construct a CloudSqlInstance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule, I think we shouldn't be making test only APIs. And parameterizing the rate limiter is actually useful. Imagine if we wanted to expose rate limiting knobs to the user, or configure instances with different rate limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I'm not convinced, but it's not worth arguing about.
return; | ||
} | ||
|
||
forceRefreshRunning = true; | ||
nextInstanceData.cancel(false); | ||
if (nextInstanceData.isCancelled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove this if block and always schedule a next refresh since we also always call cancel? This after having observed that cancel doesn't always work as we might think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. The if (isCanceled())
is now irrelevant since now we use the forceRefreshRunning
field instead.
FYI: cancel works as expected: nextInstanceData.cancel(false)
will cancel the task if it is queued, but not started. It will do nothing and return false if the task is running or done.
@@ -120,9 +120,11 @@ public static synchronized CoreSocketFactory getInstance() { | |||
@VisibleForTesting | |||
// Returns a listenable, scheduled executor that exits upon shutdown. | |||
static ListeningScheduledExecutorService getDefaultExecutor() { | |||
// TODO(kvg): Figure out correct way to determine number of threads | |||
// Set the number of threads to 8. This should be enough threads that most users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split this out into a separate PR since it's technically the fix of the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. See #1391
core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceConcurrencyTest.java
Outdated
Show resolved
Hide resolved
newRateLimiter()); | ||
|
||
assertThat(supplier.counter.get()).isEqualTo(0); | ||
Thread.sleep(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we avoid a sleep to ensure our test remains solid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like an integration test of the ScheduledExecutorService. So this test is measuring how this JVM library class behaves when time passes. In this case, there is no way to mock the passing of time.
core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceConcurrencyTest.java
Outdated
Show resolved
Hide resolved
core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceConcurrencyTest.java
Outdated
Show resolved
Hide resolved
core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceConcurrencyTest.java
Outdated
Show resolved
Hide resolved
core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceConcurrencyTest.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
/** A constructor for use in tests that allows the caller to set the RateLimiter. */ | ||
CloudSqlInstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule, I think we shouldn't be making test only APIs. And parameterizing the rate limiter is actually useful. Imagine if we wanted to expose rate limiting knobs to the user, or configure instances with different rate limits.
forceRefreshRunning = true; | ||
nextInstanceData.cancel(false); | ||
logger.fine( | ||
"Force Refresh: the next refresh operation was cancelled." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the instance to the log message in leading brackets as we've done elsewhere. You might even make a convenience method for it that all methods (or even classes) can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
logger.info( | ||
String.format( | ||
"[%s %d] INSTANCE DATA DONE", | ||
instanceName, Thread.currentThread().getId())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave the thread ID off -- unless it's worth keeping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto with below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -152,14 +151,22 @@ public InstanceData getInstanceData( | |||
.orElse(x509Certificate.getNotAfter()); | |||
} | |||
|
|||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should all be logger.fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
5f1a399
to
2c3d6cc
Compare
a5958e4
to
875cde6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can test this directly. Let's talk tomorrow about how to do that.
@GuardedBy("instanceDataGuard") | ||
private boolean forceRefreshRunning; | ||
|
||
static final RateLimiter defaultRateLimiter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just pass this in and not expand the public API here. As an example, we pass in a RateLimiter in the AlloyDB Java Connector here: https://github.com/GoogleCloudPlatform/alloydb-java-connector/blob/main/alloydb-jdbc-connector/src/main/java/com/google/cloud/alloydb/Connector.java#L156.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
private final boolean flakey; | ||
|
||
private AtomicInteger counter = new AtomicInteger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
private final boolean flakey; | ||
|
||
private AtomicInteger counter = new AtomicInteger(); | ||
private InstanceData response = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
null), | ||
new SslData(null, null, null), | ||
Date.from(Instant.now().plus(1, ChronoUnit.HOURS))); | ||
private final ListeningScheduledExecutorService executor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a local variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
private static class TestDataSupplier implements InstanceDataSupplier { | ||
|
||
private final boolean flakey; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flaky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
core/src/test/java/com/google/cloud/sql/core/CloudSqlInstanceConcurrencyTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the new tests here add ~1 minute to our build. I think there's a way we can test this without the time increase. Let's talk about how when you're ready.
MockAdminApi mockAdminApi = new MockAdminApi(); | ||
ListenableFuture<KeyPair> keyPairFuture = | ||
Futures.immediateFuture(mockAdminApi.getClientKeyPair()); | ||
ListeningScheduledExecutorService executor = newTestExecutor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since newTestExecutor is a call through to the actual executor, shall we just inline this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Method inlined.
int c = counter.incrementAndGet(); | ||
Thread.sleep(100); | ||
|
||
if (flaky && c % 2 == 0 && c > 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had added the greater than 10 in to allow Hikari to establish two pools of 5 connections each successfully before simulating flaky behavior. Does this need to be here for test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I removed it.
// Attempt to retrieve data, ensure we wait for success | ||
ListenableFuture<List<Object>> allData = | ||
Futures.allAsList( | ||
executor.submit(instance::getSslData), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the significant of three here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three would make multiple, simultaneous calls to getSslData().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why three? Why not four, or two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just needs to be more than one. And three is a magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use two in that case, and add a comment. I'm asking for my personal interest, but also for future readers who will wonder the same.
executor.submit(instance::forceRefresh)); | ||
try { | ||
all.get(); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we silently squashing the exception here? Should we add it to the test signature instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this method significantly and added comments. Please take another look.
allData2.get(); | ||
} catch (Exception e) { | ||
} | ||
assertThat(supplier.counter.get()).isEqualTo(2 + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the supplier equal to 2 plus i
? Shouldn't the supplier be called once for every completed force refresh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this should be more clear now.
} | ||
|
||
@Test | ||
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from a happy path test we have elsewhere? Do we need it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the happy path without any concurrency. Before I changed the mainline code, this test passed and the other tests with more concurrency failed. It was an important test to demonstrate that the bad behavior was caused by concurrency bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this wasn't used to drive out the bug, I think we should check that we have similar coverage elsewhere (I believe we do) and omit this one here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have exactly this same coverage elsewhere. This was used to drive out the bug. I would like to leave this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is just the same as this test. In my view, the assertions don't communicate what different behavior we have.
newRateLimiter()); | ||
|
||
Thread.sleep(500); | ||
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here? How does this test differ from other tests we already have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key difference is that this one throws an exception from within a scheduled executor task. This used to exacerbate the deadlock problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this behavior covered by testCloudSqlInstanceRefreshesConsistentlyWithoutRaceConditions
? It looks like a duplicate test to my eyes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has a single CloudSqlInstance and ensures that it successfully refreshes the correct number of times. Thus proving that the balking behavior of forceRefresh() works right.
testCloudSqlInstanceRefreshesConsistentlyWithoutRaceConditions
tries to cause a race condition and deadlock with multiple CloudSqlInstances and many more threads. It is fundimentally a different test.
While developing, if there was a bug in my code, one of these tests would fail and the other would pass. So I think they are both useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this test proves that the CloudSqlInstance calls the underlying InstanceDataSupplier, which is a variation of this test. If you'd like to test that force refresh functions correctly, then I'd expect a call that fails, a force refresh, and then a call that succeeeds.
} | ||
|
||
// Get SSL Data for each instance, forcing the first refresh to complete. | ||
instances.forEach(i -> i.getSslData()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to a lambda reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the CloudSqlInstanceConcurrencyTest based on comments.
int c = counter.incrementAndGet(); | ||
Thread.sleep(100); | ||
|
||
if (flaky && c % 2 == 0 && c > 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I removed it.
MockAdminApi mockAdminApi = new MockAdminApi(); | ||
ListenableFuture<KeyPair> keyPairFuture = | ||
Futures.immediateFuture(mockAdminApi.getClientKeyPair()); | ||
ListeningScheduledExecutorService executor = newTestExecutor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Method inlined.
// Attempt to retrieve data, ensure we wait for success | ||
ListenableFuture<List<Object>> allData = | ||
Futures.allAsList( | ||
executor.submit(instance::getSslData), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three would make multiple, simultaneous calls to getSslData().
executor.submit(instance::forceRefresh)); | ||
try { | ||
all.get(); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this method significantly and added comments. Please take another look.
allData2.get(); | ||
} catch (Exception e) { | ||
} | ||
assertThat(supplier.counter.get()).isEqualTo(2 + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this should be more clear now.
} | ||
|
||
@Test | ||
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the happy path without any concurrency. Before I changed the mainline code, this test passed and the other tests with more concurrency failed. It was an important test to demonstrate that the bad behavior was caused by concurrency bugs.
newRateLimiter()); | ||
|
||
Thread.sleep(500); | ||
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key difference is that this one throws an exception from within a scheduled executor task. This used to exacerbate the deadlock problem.
} | ||
|
||
// Get SSL Data for each instance, forcing the first refresh to complete. | ||
instances.forEach(i -> i.getSslData()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// Attempt to retrieve data, ensure we wait for success | ||
ListenableFuture<List<Object>> allData = | ||
Futures.allAsList( | ||
executor.submit(instance::getSslData), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why three? Why not four, or two?
} | ||
} | ||
|
||
@Test(timeout = 45000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If there is a deadlock, then the test will only fail with a timeout. So we should leave this in just incase we cause a deadlock again in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to wait 45 seconds before failing? What's the default value?
} | ||
|
||
@Test | ||
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this wasn't used to drive out the bug, I think we should check that we have similar coverage elsewhere (I believe we do) and omit this one here.
newRateLimiter()); | ||
|
||
Thread.sleep(500); | ||
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this behavior covered by testCloudSqlInstanceRefreshesConsistentlyWithoutRaceConditions
? It looks like a duplicate test to my eyes.
// exception after the first refresh attempt fails. | ||
} | ||
|
||
@Test(timeout = 45000) // 45 seconds timeout in case of deadlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If this test deadlocks, then the test will fail with a timeout. Otherwise it will hang indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- can we bump this number down? Can we rely on the default value? I assume the test framework does a default timeout (I think it's 30s).
} | ||
} | ||
|
||
@Test(timeout = 45000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to wait 45 seconds before failing? What's the default value?
// Attempt to retrieve data, ensure we wait for success | ||
ListenableFuture<List<Object>> allData = | ||
Futures.allAsList( | ||
executor.submit(instance::getSslData), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use two in that case, and add a comment. I'm asking for my personal interest, but also for future readers who will wonder the same.
// We expect some of these to throw an exception indicating that the refresh cycle | ||
// got a failed attempt. When they throw an exception, | ||
// sleep and try again. This shows that the refresh cycle is working. | ||
Thread.sleep(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a log statement here too?
} | ||
|
||
@Test | ||
public void testCloudSqlInstanceCorrectlyRefreshesInstanceData() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is just the same as this test. In my view, the assertions don't communicate what different behavior we have.
keyPairFuture, | ||
newRateLimiter()); | ||
|
||
Thread.sleep(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the sleep here?
newRateLimiter()); | ||
|
||
Thread.sleep(500); | ||
Assert.assertThrows(RuntimeException.class, () -> instance.getSslData()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this test proves that the CloudSqlInstance calls the underlying InstanceDataSupplier, which is a variation of this test. If you'd like to test that force refresh functions correctly, then I'd expect a call that fails, a force refresh, and then a call that succeeeds.
// exception after the first refresh attempt fails. | ||
} | ||
|
||
@Test(timeout = 45000) // 45 seconds timeout in case of deadlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- can we bump this number down? Can we rely on the default value? I assume the test framework does a default timeout (I think it's 30s).
Update the logic in forceRefresh() to reduce the churn on the thread pool when the certificate refresh API calls are failing.
New forceRefresh() logic ensures that:
Add new test cases to validate race conditions, deadlocks, and concurrency.
Add additional logging to help diagnose production problems with certificate refresh.
Related to #1314
Fixes #1209
Fixes #1159