-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bad_certificate
errors intermittently preventing connection to cloud sql
#1314
Comments
I'm investigating this error. I agree that it may be related to #1307, however it may also be a separate issue. There are a number of cases that could cause this to occur, especially if the app has to run for a few hours, requiring the ephemeral certificate to be refreshed a few times. |
I have finished the work to shore up certificate refresh bugs, See #1307. Please try out the next version of the JDBC Connector and see if the problem goes away. |
FYI: We'll be cutting a release on Tuesday. |
Hello @hessjcg. I was just about to open up a new issue when I saw this one, we are seeing this exact issue (same stacktrace, with error starting after app has run for a few hours) on both version We are suspecting that this commit in some way changed the behaviour of the certificate refresh logic to cause this error described by @msammarco, at least we can't find another relevant commit between versions Is this something you are able to look into? I currently have not reproduced it locally, although I could try. |
As mentioned by @Ragge-dev, we also experience the exact same problem. Version |
Let's re-open this issue in that case. We'll investigate. |
@Ragge-dev and @tomirio619 I understand it takes a few hours for this to manifest. How frequently do you see it and does it affect application connectivity? |
After testing, I found that the version in which the problem was introduced is |
@enocom Sorry for the delay, what @tomirio619 describes is exactly our experience as well. The error always occurs a few hours after the app has started. |
Where are you running the Java Connector? Is this all in GKE? |
FYI I've added some debug logging here: #1348. This should at least make it clear what's going wrong with our refresh operations. |
We'll cut a release tomorrow which might fix this, as we've made some further improvements to the factoring of our background refresh operations and the logging will be a part of it. Meanwhile, we'll work on reproducing the issue or confirming it's been fixed. |
Yes |
Great, I will test it when it comes out |
I ran a simple Java application that runs connects and query every minute in GKE. I was unable to reproduce this exact error, but found an occasional problem during refresh where the root cause was: I will continue to run tests to see if this can be reproduced. A few more things I will try:
|
Thank you for continuing to look into this! I tried an app yesterday with the latest version (1.13.0) and the problem (bad certificate) appeared after ~4 hours, so it's still there. I'll try again today to see if I can see your debug logging, but I'll be leaving for vacation after this week so will be my last attempt for a while. Might be relevant information that I realize I have not shared, we only experience this issue on apps which are connected to multiple databases (postgres in this case). Any app only connected to one database works fine with the latest versions. |
Thank you for sharing that important detail. We will try to reproduce with two databases today. Are these apps connecting to multiple databases on the same Cloud SQL Postgres instance, or are they connecting to multiple Cloud SQL Postgres instances? |
I imagine we can reproduce this with more connection pools to the same instance. For now, it's best to stick with v1.11.1 and we'll work to fix this and cut a patch release as soon as we're convinced we've got the problem solved. |
I've been running a little Spring Boot app that connects to two databases in one instance in my GKE cluster. My app uses two separate connection pools and has Hikari setup to refresh all connections every minute (with the thought that connection creation will trigger this bug). After half a day, I don't see the bad certificate yet. I'm going to downgrade this to a p2 to signal that this bug isn't as pervasive as we original thought. But we're still working on identifying the root cause. |
Interesting, in our case the affected app is also connecting to multiple different CloudSQL instances |
To refresh a Cloud SQL Instance's certificates, the current algorithm uses 2 threads from the thread pool for each instance. Because the thread pool is fixed size, if a user configures 2 or more instances, they were at risk of a causing thread starvation and a deadlock. This increases the thread count to a safe level so that most users will never experience a deadlock. Fixes #1314
Hey folks, we found an edge case where the connector when configured with multiple instances would flood the thread pool in such a way that the threads would be unable to progress. We've identified a few related improvements in #1391 and #1390. We'll cut a patch release this week for people affected by this. |
Update the logic in forceRefresh() to reduce the churn on the thread pool when the certificate refresh API calls are failing. New forceRefresh() logic ensures that: Only 1 refresh cycle may run at a time. If a refresh cycle is in progress, then it will not be canceled until it succeeds. Add new test cases to validate race conditions, deadlocks, and concurrency. Add additional logging to help diagnose production problems with certificate refresh. Related to #1314 Fixes #1209 Fixes #1159
This has been fixed in v1.13.1. Thank you for all your help debugging this folks. |
Bug Description
When using hikari database connection pooling with cloud sql we're seeing intermittently, usually after a few hours an issue where connections become unavailable with a
bad_certificate
error. We see this with mysql and postgres cloud sql connections authenticating with username/password, and differing hikari configurations (most set with the default hikari config of maxLifetime of 30 mins).Example code (or command)
No response
Stacktrace
Steps to reproduce?
Environment
Additional Details
My initial reaction was it looks similar to the closed issue here #472
And is possibly related to this Draft PR.
My colleague has opened an incident with google support already for this issue, but I think it might be worthwhile to open an issue here too. Case 45213975
The text was updated successfully, but these errors were encountered: