-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to connect to CloudSQL databases after certificates expire, application can't recover #2059
Comments
Roy, This is definitely not the expected behavior. Thank you for the thorough description! Based on your description, it seems that the task scheduler that runs the refresh operations gets into a state where it doesn't work correctly. As a work-around, you could try configuring your application to use the "Lazy Refresh" strategy instead of the default "Refresh Ahead" strategy. When "lazy refresh" strategy is enabled, the connector to check and refresh the certificate while opening a database connection, instead of using a scheduled task. See Refresh Strategy for Serverless Compute. To enable the lazy refresh strategy, add the JDBC connection property Meanwhile, we will investigate what changed between release 1.14.1 and 1.18.0 for clues as to why the connector is now less reliable for you. -Jonathan |
Hey Jonathan, Thanks for the workaround, I'll try it out. Is there any log message that I should see to know for sure that it is now using the lazy strategy? (Just to make sure my connection property is in the right place) Sincerely, Roy. |
Is the CPU throttled on your workloads where you see this issue? We have encountered similar problems in other server-less runtimes where the available CPU is extremely limited when the pod is not serving a request. I wonder if your configuration does something similar. |
We've been running the 'lazy' workaround now for a week on two services. One with 2 pods and low CPU (100m request) and one on 4 pods with high CPU request (8000m). Both services again ran into the same problem. So, I assuming now that how much CPU is requested is not the issue. I do think the behavior is slightly different now. Since August 26th everything was running fine. On Friday the 29th the database connection was having problems intermittently for two hours, but it does recover. Today (again 3 days later) I again see intermittent database connection problems, but the app also recovers. I did some more digging: At first I regularly see a connection being made successfully.
After a while I might see one or two connections fail.
I then see that the certificate is refreshed.
Soon after that all connections fail:
After a few minutes the application recovers. Only after 30 minutes I see new connection errors. This pattern repeats a couple of times. After which the application stays stable for a couple of days. I wonder if something special is going with connection pooling and the time at which certificates need to be refreshed? Are they maybe valid for a couple of days, and do I see then multiple threads in the pool, refresh, fail, recover? |
Hi @roy-t, I think we have figured out the issue. When you use Cloud Run, the IAM Auth token is valid for 30 minutes, while the Cloud SQL client certificate is valid for 1 hour. The connector attempts to refresh the token, it somehow puts an expired IAM Auth Token into the client certificate. I was not able to reproduce exactly this behavior, but I was able to reproduce this situation. Where the connector refreshes the certificate 4 minutes before the auth token expires, but uses the same almost-expired auth token. This causes the refresh operation rapidly repeat until the IAM Auth token has fully expired. I have updated the connector to force the refresh of the IAM auth token before it attempts a certificate refresh. See #2063. This behavior is more sane, and should avoid the race conditions that cause the refresh operation to thrash and possibly lock up entirely. |
I have an additional hypothesis about what is happening behind the scenes. I would like to investigate. This could affect the both IAM AuthN and password authentication. Perhaps Google Cloud Authentication library has a bug that affects Cloud Run, GKE, and maybe some other runtimes. The methods |
Note that in @roy-t ’s case, he’s running on GKE (kubernetes). |
…2063) Refresh tokens and certificates 4 minutes before they expire to avoid creating race condition that would allow the connector to create an ephemeral certificate with an expired auth token. Now, IAM auth tokens are now refreshed 4 minutes before they token expire. Also, the Lazy Refresh Strategy will refresh the client certificate 4 minutes before the expiration of the certificate and the IAM auth token. This should mitigate some of the strange certificate expiration errors commonly found in Cloud Run, see: #2059
@wleese, thanks for pointing that out. The Auth client library could bug may affect more than just Cloud Run. |
The release is done. Please upgrade to v1.20.1 or higher. |
Great to hear you might have found it. I've overwritten my pom to use v.1.20.1 and I'll report back:
|
@julie0005 Can you please open a separate issue on this repo? Your issue seems unrelated to the original one this thread was created for. This will make it easier for us to help individual issues and not cause confusion to future users who may reference these in the future to get help 😄 |
In the week from September 2 to September 9 we had 250 JDBC related error messages in our service. ( When investigating the 3 JDBC message that still happened last week, I couldn't put them near any certificate expiration/refresh, might just be a weird network thing, or something else that . I'll continue monitoring! So to me it looks like v1.20.1 does solve the issue! |
Going to close this out as it seems based on the recent comment this has likely been resolved. If more related errors arise feel free to re-open the issue 😄 Thanks again for all the great insight @roy-t 👏 |
Bug Description
Some of our applications are unable to connect to their CloudsQL database after running for a few days. They never recover until we restart them. We've enabled debug logging and suspect a problem with refreshing certificates.
We see that all applications continuously verify the certificate expiry:
At some point in time we see a refresh operation scheduled:
Most applications then refresh the certificate at the scheduled time, but a few applications fail to do this and break as seen above.
In the logs below it looks like the refresh never happened at 10:00:14Z. Or at least that an old certificate is still lingering. Because at 12:05:17 the applications tries to use the certificate that expired 2 hours ago.
It looks like refreshing the certificate fails, or at least the application is still using the wrong one. As after connections to the database fail.
We haven't been able to identify what is different between applications that work and fail. In some cases even copies of applications (like one for testing and one for production) behave differently. With one working flawlessly for months, while the other one fails every few days. One thing I have noticed is that these applications all have a relatively low load and can idle for hours before a new database interaction happens. But I'm not sure if that is a clue or just an unrelated factor.
We are currently using the following library versions:
We did not observe the issue when using:
We started seeing the issue when using:
I haven't been able to narrow it down further yet.
We found the following issue: #1314 which looks similar, but was fixed in
jdbc-socket-factory-core:jar:1.13.1
. We are currently using 1.18.0.Example code (or command)
No response
Stacktrace
Steps to reproduce?
Unfortunately I don't have a good reproduction path. The issue seems to be consistently happening with only a few of our applications. Only happens after working well for a couple of days.
Environment
Additional Details
No response
The text was updated successfully, but these errors were encountered: