Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to connect to CloudSQL databases after certificates expire, application can't recover #2059

Closed
roy-t opened this issue Aug 21, 2024 · 15 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@roy-t
Copy link

roy-t commented Aug 21, 2024

Bug Description

Some of our applications are unable to connect to their CloudsQL database after running for a few days. They never recover until we restart them. We've enabled debug logging and suspect a problem with refreshing certificates.

We see that all applications continuously verify the certificate expiry:

[bolcom-pro-sre-health-538:europe-west4:sre-health] Now = 2024-07-30T09:04:39.959408416Z, Current client certificate expiration = 2024-07-30T10:04:14.941Z

At some point in time we see a refresh operation scheduled:

Jul 30, 2024 @ 11:04:39.553 DEBUG com.google.cloud.sql.core.RefreshAheadStrategy [bolcom-pro-sre-health-538:europe-west4:sre-health] Refresh Operation: Next operation scheduled at 2024-07-30T10:00:14Z.

Most applications then refresh the certificate at the scheduled time, but a few applications fail to do this and break as seen above.

In the logs below it looks like the refresh never happened at 10:00:14Z. Or at least that an old certificate is still lingering. Because at 12:05:17 the applications tries to use the certificate that expired 2 hours ago.

Jul 30, 2024 @ 12:05:17.156 DEBUG com.google.cloud.sql.core.RefreshAheadStrategy [bolcom-pro-sre-health-538:europe-west4:sre-health] Client certificate has expired. Starting next refresh operation immediately.
Jul 30, 2024 @ 12:05:17.156 DEBUG com.google.cloud.sql.core.RefreshAheadStrategy [bolcom-pro-sre-health-538:europe-west4:sre-health] Now = 2024-07-30T10:05:17.156356729Z, Current client certificate expiration = 2024-07-30T10:04:14.941Z

It looks like refreshing the certificate fails, or at least the application is still using the wrong one. As after connections to the database fail.

Jul 30, 2024 @ 12:05:17.157 DEBUG com.google.cloud.sql.core.Connector [10.25.8.40] Connecting to instance.
Jul 30, 2024 @ 12:05:17.172 DEBUG com.google.cloud.sql.core.Connector [10.25.8.40] Connected to instance successfully.
Jul 30, 2024 @ 12:05:17.174 DEBUG com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Cannot acquire connection from data source org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:358) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:273) at org.postgresql.Driver.makeConnection(Driver.java:446) at org.postgresql.Driver.connect(Driver.java:298) at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:137) at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:360) at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:202) at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:461) at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:724) at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:703) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1583) caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130) at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:365) at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:287) at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204) at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172) at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1509) at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1480) at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1066) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:162) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:129) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:114) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:74) at org.postgresql.core.PGStream.receiveChar(PGStream.java:467) at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:683) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:207) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:262) ... 14 common frames elided
Jul 30, 2024 @ 12:05:17.174 DEBUG com.zaxxer.hikari.pool.PoolBase HikariPool-1 - Failed to create/setup connection: The connection attempt failed.

We haven't been able to identify what is different between applications that work and fail. In some cases even copies of applications (like one for testing and one for production) behave differently. With one working flawlessly for months, while the other one fails every few days. One thing I have noticed is that these applications all have a relatively low load and can idle for hours before a new database interaction happens. But I'm not sure if that is a clue or just an unrelated factor.

We are currently using the following library versions:

  • spring-cloud-gcp 5.4.1
  • spring-boot 3.3.1
  • HikariCP 5.1.0
  • com.google.cloud.sql:jdbc-socket-factory-core:jar:1.18.0

We did not observe the issue when using:

  • spring-cloud-gcp 4.9.4
  • spring-boot 3.1.9
  • HikariCP 5.0.1
  • com.google.cloud.sql:jdbc-socket-factory-core:jar:1.14.1

We started seeing the issue when using:

  • spring-cloud-gcp 5.4.0
  • spring-boot 3.2.6
  • HikariCP 5.0.1
  • com.google.cloud.sql:jdbc-socket-factory-core:jar:1.18.0

I haven't been able to narrow it down further yet.

We found the following issue: #1314 which looks similar, but was fixed in jdbc-socket-factory-core:jar:1.13.1. We are currently using 1.18.0.

Example code (or command)

No response

Stacktrace

om.zaxxer.hikari.pool.HikariPool HikariPool-1 - Cannot acquire connection from data source org.postgresql.util.PSQLException: The connection attempt failed. at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:358) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:54) at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:273) at org.postgresql.Driver.makeConnection(Driver.java:446) at org.postgresql.Driver.connect(Driver.java:298) at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:137) at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:360) at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:202) at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:461) at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:724) at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:703) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1583) caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130) at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:365) at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:287) at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:204) at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172) at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1509) at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1480) at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1066) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:162) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:129) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:114) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:74) at org.postgresql.core.PGStream.receiveChar(PGStream.java:467) at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:683) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:207) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:262) ... 14 common frames elided
Jul 30, 2024 @ 12:05:17.174 DEBUG com.zaxxer.hikari.pool.PoolBase HikariPool-1 - Failed to create/setup connection: The connection attempt failed.

Steps to reproduce?

Unfortunately I don't have a good reproduction path. The issue seems to be consistently happening with only a few of our applications. Only happens after working well for a couple of days.

Environment

  1. OS: Debian 12, running on Google Kubernetes Engine
  2. JRE: Azul Zulu 21
  3. Cloud SQL Java Socket Factory version: 1.18.0

Additional Details

No response

@roy-t roy-t added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Aug 21, 2024
@hessjcg hessjcg added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Aug 22, 2024
@hessjcg
Copy link
Collaborator

hessjcg commented Aug 22, 2024

Roy,

This is definitely not the expected behavior. Thank you for the thorough description! Based on your description, it seems that the task scheduler that runs the refresh operations gets into a state where it doesn't work correctly.

As a work-around, you could try configuring your application to use the "Lazy Refresh" strategy instead of the default "Refresh Ahead" strategy.

When "lazy refresh" strategy is enabled, the connector to check and refresh the certificate while opening a database connection, instead of using a scheduled task. See Refresh Strategy for Serverless Compute.

To enable the lazy refresh strategy, add the JDBC connection property cloudSqlRefreshStrategy=lazy.

Meanwhile, we will investigate what changed between release 1.14.1 and 1.18.0 for clues as to why the connector is now less reliable for you.

-Jonathan

@hessjcg hessjcg added priority: p0 Highest priority. Critical issue. P0 implies highest priority. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Aug 25, 2024
@roy-t
Copy link
Author

roy-t commented Aug 26, 2024

Hey Jonathan,

Thanks for the workaround, I'll try it out. Is there any log message that I should see to know for sure that it is now using the lazy strategy? (Just to make sure my connection property is in the right place)

Sincerely,

Roy.

@roy-t
Copy link
Author

roy-t commented Aug 26, 2024

I think I've managed to enable the workaround:

image

@hessjcg
Copy link
Collaborator

hessjcg commented Aug 26, 2024

@roy-t,

Is the CPU throttled on your workloads where you see this issue? We have encountered similar problems in other server-less runtimes where the available CPU is extremely limited when the pod is not serving a request. I wonder if your configuration does something similar.

@roy-t
Copy link
Author

roy-t commented Aug 27, 2024

I did some checking,

In a few instances there is clear throttling around the time the database connection issues start happening. For example, here the connection issues started happening at 11:30.
image

But I can't find a good correspondence between all instances where we had connection issues, and throttling. And I wonder if the throttling happens because the certificate refresh requires a lot of CPU (sounds strange) or because the high error rate/retries after the database connection issues start.

One of the services I personally monitor, so know the most about, has a really low request CPU (100m) and a higher limit (1500m). (As it doesn't do much most of the day). Would this cause problems, and would you recommend then using the 'lazy' workaround or to just set the CPU request higher and accept some waste?

@enocom enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p0 Highest priority. Critical issue. P0 implies highest priority. labels Aug 27, 2024
@roy-t
Copy link
Author

roy-t commented Sep 2, 2024

We've been running the 'lazy' workaround now for a week on two services. One with 2 pods and low CPU (100m request) and one on 4 pods with high CPU request (8000m). Both services again ran into the same problem. So, I assuming now that how much CPU is requested is not the issue.

I do think the behavior is slightly different now.

Since August 26th everything was running fine. On Friday the 29th the database connection was having problems intermittently for two hours, but it does recover. Today (again 3 days later) I again see intermittent database connection problems, but the app also recovers.

I did some more digging:

At first I regularly see a connection being made successfully.

DEBUG 2024-08-30T19:57:02.999Z [10.25.8.111] Connecting to instance.
DEBUG 2024-08-30T19:57:03.002Z [10.25.8.111] Connected to instance successfully.

After a while I might see one or two connections fail.

ERROR 2024-08-30T20:02:44.895Z An unhandled exception occurred for request: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection..

I then see that the certificate is refreshed.

DEBUG 2024-08-30T20:01:37.752Z [bolcom-pro-cloud-explorer-3d2:europe-west4:cloud-explorer-csql] Lazy Refresh Operation: Client certificate has expired. Starting next refresh operation immediately.
DEBUG 2024-08-30T20:01:37.753Z [bolcom-pro-cloud-explorer-3d2:europe-west4:cloud-explorer-csql] Lazy Refresh Operation: Starting refresh operation.
DEBUG 2024-08-30T20:01:37.831Z [bolcom-pro-cloud-explorer-3d2:europe-west4:cloud-explorer-csql] METADATA DONE
DEBUG 2024-08-30T20:01:37.894Z [bolcom-pro-cloud-explorer-3d2:europe-west4:cloud-explorer-csql 5449] CERT DONE
DEBUG 2024-08-30T20:01:37.915Z [bolcom-pro-cloud-explorer-3d2:europe-west4:cloud-explorer-csql 5449] SSL CONTEXT
DEBUG 2024-08-30T20:01:37.915Z [bolcom-pro-cloud-explorer-3d2:europe-west4:cloud-explorer-csql] INSTANCE DATA DONE
DEBUG 2024-08-30T20:01:37.915Z [bolcom-pro-cloud-explorer-3d2:europe-west4:cloud-explorer-csql] Lazy Refresh Operation: Completed refresh with new certificate expiration at 2024-08-30T20:55:01.441Z.

Soon after that all connections fail:

ERROR 2024-08-30T20:02:44.895Z An unhandled exception occurred for request: org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:84) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:653) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:723) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:748) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:804) at org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate.query(NamedParameterJdbcTemplate.java:218) at org.springframework.jdbc.core.namedparam.NamedParameterJdbcTemplate.query(NamedParameterJdbcTemplate.java:225) at com.bol.cloudexplorer.consolidator.data.ProjectRepository.getProject(ProjectRepository.java:144) at com.bol.cloudexplorer.app.endpoints.ExplorerResource.getConnections(ExplorerResource.java:128) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:255) at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:188) at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:926) at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:831) at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87) at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1089) at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:979) at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1014) at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:903) at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:564) at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java…

After a few minutes the application recovers. Only after 30 minutes I see new connection errors. This pattern repeats a couple of times. After which the application stays stable for a couple of days.

I wonder if something special is going with connection pooling and the time at which certificates need to be refreshed? Are they maybe valid for a couple of days, and do I see then multiple threads in the pool, refresh, fail, recover?

@hessjcg
Copy link
Collaborator

hessjcg commented Sep 5, 2024

Hi @roy-t, I think we have figured out the issue. When you use Cloud Run, the IAM Auth token is valid for 30 minutes, while the Cloud SQL client certificate is valid for 1 hour. The connector attempts to refresh the token, it somehow puts an expired IAM Auth Token into the client certificate.

image

I was not able to reproduce exactly this behavior, but I was able to reproduce this situation. Where the connector refreshes the certificate 4 minutes before the auth token expires, but uses the same almost-expired auth token. This causes the refresh operation rapidly repeat until the IAM Auth token has fully expired.

image

I have updated the connector to force the refresh of the IAM auth token before it attempts a certificate refresh. See #2063.

image

This behavior is more sane, and should avoid the race conditions that cause the refresh operation to thrash and possibly lock up entirely.

@hessjcg
Copy link
Collaborator

hessjcg commented Sep 5, 2024

I have an additional hypothesis about what is happening behind the scenes. I would like to investigate. This could affect the both IAM AuthN and password authentication.

Perhaps Google Cloud Authentication library has a bug that affects Cloud Run, GKE, and maybe some other runtimes. The methods GoogleCredentials.getAccessToken() and GoogleCredentials.refresh() run their API calls asynchronously using an executor. If that executor were to get locked up for some reason, then these calls could block indefinitely. Then, because GoogleCredentials.getAccessToken() blocks, the connector would be unable to perform a refresh operation. In Background Refresh mode, the connector's executor could become blocked. In Lazy refresh mode, the Hikari connection pool would log an 'unable to acquire database connection, timeout' on every connection attempt after the client certificate expired.

@wleese
Copy link

wleese commented Sep 5, 2024

Perhaps Google Cloud Authentication library has a bug that affects only Cloud Run.

Note that in @roy-t ’s case, he’s running on GKE (kubernetes).

hessjcg added a commit that referenced this issue Sep 5, 2024
…2063)

Refresh tokens and certificates 4 minutes before they expire to avoid creating race condition that would allow the connector to create an ephemeral certificate with an expired auth token.

Now, IAM auth tokens are now refreshed 4 minutes before they token expire. Also, the Lazy Refresh Strategy will refresh the client certificate 4 minutes before the expiration of the certificate and the IAM auth token.

This should mitigate some of the strange certificate expiration errors commonly found in Cloud Run, see: #2059
@hessjcg
Copy link
Collaborator

hessjcg commented Sep 5, 2024

@wleese, thanks for pointing that out. The Auth client library could bug may affect more than just Cloud Run.

@hessjcg
Copy link
Collaborator

hessjcg commented Sep 5, 2024

The release is done. Please upgrade to v1.20.1 or higher.

@hessjcg hessjcg added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Sep 5, 2024
@roy-t
Copy link
Author

roy-t commented Sep 9, 2024

Great to hear you might have found it. I've overwritten my pom to use v.1.20.1 and I'll report back:

    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>com.google.cloud.sql</groupId>
                <artifactId>postgres-socket-factory</artifactId>
                <version>1.20.1</version>
            </dependency>
            <dependency>
                <groupId>com.google.cloud.sql</groupId>
                <artifactId>jdbc-socket-factory-core</artifactId>
                <version>1.20.1</version>
            </dependency>
        </dependencies>
    </dependencyManagement>

@jackwotherspoon
Copy link
Collaborator

@julie0005 Can you please open a separate issue on this repo? Your issue seems unrelated to the original one this thread was created for. This will make it easier for us to help individual issues and not cause confusion to future users who may reference these in the future to get help 😄

@roy-t
Copy link
Author

roy-t commented Sep 16, 2024

In the week from September 2 to September 9 we had 250 JDBC related error messages in our service. (jsonPayload.message=~"JDBC") Since the changes I made on September 9th we've only had 3 JDBC related error message.

When investigating the 3 JDBC message that still happened last week, I couldn't put them near any certificate expiration/refresh, might just be a weird network thing, or something else that . I'll continue monitoring!

So to me it looks like v1.20.1 does solve the issue!

@jackwotherspoon
Copy link
Collaborator

Going to close this out as it seems based on the recent comment this has likely been resolved.

If more related errors arise feel free to re-open the issue 😄

Thanks again for all the great insight @roy-t 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants