aws-lambda: Log retention gives rate exceeded error #31338

Exter-dg · 2024-09-06T06:36:41Z

Describe the bug

Legacy log retention in Lambda gives a rate limit exceeded error.

We are in the process of upgrading our app from CDK v1 to v2. To test this, we created a new env in a new account and redeployed the configuration using CDK v1.

We are creating 70-80 lambdas with log retention enabled. The legacy log retention creates a custom lambda resource to create log group and set log retention. CDK V1 used to create Node 14 lambas for this purpose (for which the creation is blocked in AWS). Hence, we disabled the log retention and upgraded the stack to 2.151.0 and then enabled the log retention.

While doing so, our stack is failing with the error:

Received response status [FAILED] from custom resource. Message returned: Out of attempts to change log group

Initially we thought this is an issue with the “CreateLogGroup throttle limit in transactions per second” quota. We increased it to 80 from 10 but the issue still exists.

On exploring the cloudwatch logs for the custom lambda resource, we found:

2024-09-06T05:23:33.260Z	06a9833f-0ad3-4faf-8f94-aa78dd49d0ec	ERROR	{
  clientName: 'CloudWatchLogsClient',
  commandName: 'PutRetentionPolicyCommand',
  input: {
    logGroupName: '/aws/lambda/LogRetentionaae0aa3c5b4d-mE6Tt6xks1CB',
    retentionInDays: 1
  },
  error: ThrottlingException: Rate exceeded
      at de_ThrottlingExceptionRes (/var/runtime/node_modules/@aws-sdk/client-cloudwatch-logs/dist-cjs/index.js:2321:21)
      at de_CommandError (/var/runtime/node_modules/@aws-sdk/client-cloudwatch-logs/dist-cjs/index.js:2167:19)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/core/dist-cjs/index.js:165:18
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-retry/dist-cjs/index.js:320:38
      at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/index.js:34:22
      at async /var/task/index.js:1:1148
      at async /var/task/index.js:1:2728
      at async y (/var/task/index.js:1:1046) {
    '$fault': 'client',
    '$metadata': {
      httpStatusCode: 400,
      requestId: 'e247739c-8ebb-40d3-b85e-293802a87e24',
      extendedRequestId: undefined,
      cfId: undefined,
      attempts: 3,
      totalRetryDelay: 466
    },
    __type: 'ThrottlingException'
  },
  metadata: {
    httpStatusCode: 400,
    requestId: 'e247739c-8ebb-40d3-b85e-293802a87e24',
    extendedRequestId: undefined,
    cfId: undefined,
    attempts: 3,
    totalRetryDelay: 466
  }
}

Looks like an issue with the rate limit for PutRetentionPolicyCommand. The service quota for the same cannot be changed. Our earlier implementation had one difference in how log retention was implemented.
The base property was enabled to apply a exponential backoff (probably to handle such cases). This is now deprecated and hence we removed it during our upgrade from CDK v1 to v2. The documentation for LogRetentionRetryOptions says that this was removed as it is handled differently in AWS SDK v3. Is this what is causing the issue? Should't CDK/ SDK handle the backoff in this case?

Regression Issue

Select this option if this issue appears to be a regression.

Last Known Working CDK Version

1.204.0

Expected Behavior

Log retention backoff should be handled internally

Current Behavior

Creating legacy log retention for multiple lambas together gives a rate limit exceeded error.

Reproduction Steps

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.108.1

Framework Version

No response

Node.js Version

v22.1.0

OS

MacOs

Language

TypeScript

Language Version

No response

Other information

No response

The text was updated successfully, but these errors were encountered:

When the Log Retention Lambda runs massively parallel (on 70+ Lambdas at the same time), it can run into throttling problems and fail. Raise the retry count and delays: - Raise the default amount of retries from 5 -> 10 - Raise the sleep base from 100ms to 1s. - Change the sleep calculation to apply the 10s limit *after* jitter instead of before (previously, we would take a fraction of 10s; now we're taking a fraction of the accumulated wait time, and after calculating that limit it to 10s). Fixes #31338.

Exter-dg · 2024-09-06T15:12:38Z

@rix0rrr Is this related?
#24485

Exter-dg · 2024-09-10T05:31:19Z

We fixed it by increasing the value of maxRetries from 7 to 20.

Exter-dg added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 6, 2024

github-actions bot added @aws-cdk/aws-lambda Related to AWS Lambda potential-regression Marking this issue as a potential regression to be checked by team member labels Sep 6, 2024

rix0rrr removed the potential-regression Marking this issue as a potential regression to be checked by team member label Sep 6, 2024

rix0rrr linked a pull request Sep 6, 2024 that will close this issue

fix(lambda): configuring log retention fails on 70+ Lambdas #31340

Open

1 task

khushail added p2 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Sep 6, 2024

This was referenced Sep 9, 2024

Weekly issue metrics report #31368

Open

Weekly issue metrics report #31369

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-lambda: Log retention gives rate exceeded error #31338

aws-lambda: Log retention gives rate exceeded error #31338

Exter-dg commented Sep 6, 2024 •

edited

Loading

Exter-dg commented Sep 6, 2024

Exter-dg commented Sep 10, 2024

aws-lambda: Log retention gives rate exceeded error #31338

aws-lambda: Log retention gives rate exceeded error #31338

Comments

Exter-dg commented Sep 6, 2024 • edited Loading

Describe the bug

Regression Issue

Last Known Working CDK Version

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

Exter-dg commented Sep 6, 2024

Exter-dg commented Sep 10, 2024

Exter-dg commented Sep 6, 2024 •

edited

Loading