Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH][v4.4.3][ESP32-S3] "Timed out waiting for completion of AES Interrupt" (IDFGH-9264) #10647

Closed
3 tasks done
chipweinberger opened this issue Jan 30, 2023 · 15 comments
Closed
3 tasks done
Assignees
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally Type: Bug bugs in IDF

Comments

@chipweinberger
Copy link
Contributor

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v4.4.

Operating System used.

macOS

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-S3 Dev Kit C

Power Supply used.

USB

What is the expected behavior?

AES CBC should complete.

What is the actual behavior?

Hits AES timeout.

/* Wait for AES hardware block operation to complete */
static void esp_aes_dma_wait_complete(bool use_intr, lldesc_t *output_desc)
{
#if defined (CONFIG_MBEDTLS_AES_USE_INTERRUPT)
    if (use_intr) {
        if (!xSemaphoreTake(op_complete_sem, 2000 / portTICK_PERIOD_MS)) {
            /* indicates a fundamental problem with driver */
            ESP_LOGE("AES", "Timed out waiting for completion of AES Interrupt");
            abort();
        }
#ifdef CONFIG_PM_ENABLE
        esp_pm_lock_release(s_pm_cpu_lock);
        esp_pm_lock_release(s_pm_sleep_lock);
#endif  // CONFIG_PM_ENABLE
    }
#endif

Steps to reproduce.

I've never hit this before. Might be tempermental.

Debug Logs.

0x40376487: panic_abort at /esp-idf/components/esp_system/panic.c:429

0x40383581: esp_system_abort at /esp-idf/components/esp_system/esp_system.c:128

0x4038daab: abort at /esp-idf/components/newlib/abort.c:46

0x4211057e: esp_aes_dma_wait_complete at /esp-idf/components/mbedtls/port/aes/dma/esp_aes.c:215
 (inlined by) esp_aes_process_dma at /esp-idf/components/mbedtls/port/aes/dma/esp_aes.c:435

0x421109a3: esp_aes_crypt_cbc at /esp-idf/components/mbedtls/port/aes/dma/esp_aes.c:729

0x4207f297: pd_ota_session_write(unsigned long long, pd_buf_t, pd_error_t*) at /jamcorder-firmware/jamcorder_common/jcommon/ota/pd_ota_session.cpp:467
 (inlined by) pd_ota_session_write(unsigned long long, pd_buf_t, pd_error_t*) at /jamcorder-firmware/jamcorder_common/jcommon/ota/pd_ota_session.cpp:672

More Information.

No response

@chipweinberger chipweinberger added the Type: Bug bugs in IDF label Jan 30, 2023
@chipweinberger
Copy link
Contributor Author

chipweinberger commented Jan 30, 2023

This is really bad behavior. Crashing is not acceptable during an ota update (where I use AES CBC).

Can we not return an error? Should we increase this timeout? is 2 seconds really enough?

@chipweinberger chipweinberger changed the title [v4.4.3][ESP32-S3] "Timed out waiting for completion of AES Interrupt" [CRASH][v4.4.3][ESP32-S3] "Timed out waiting for completion of AES Interrupt" Jan 30, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 30, 2023
@github-actions github-actions bot changed the title [CRASH][v4.4.3][ESP32-S3] "Timed out waiting for completion of AES Interrupt" [CRASH][v4.4.3][ESP32-S3] "Timed out waiting for completion of AES Interrupt" (IDFGH-9264) Jan 30, 2023
@chipweinberger
Copy link
Contributor Author

PR to do something more friendly: https://github.com/espressif/esp-idf/pull/10648/files

@ESP-Marius
Copy link
Collaborator

Hmm, I dont really see how this could happen. It would help a lot if you were able to find a way to reproduce it.

Adding an error return instead of abort seems reasonable, but I would still like to get to the bottom of how this happened.

@ESP-Marius
Copy link
Collaborator

@chipweinberger are you by any chance using the Digital signature peripheral as part of your OTA process? (or at all in your app)

@chipweinberger
Copy link
Contributor Author

No I don't use DS peripheral for OTA.

Wifi + aes.

Yes it would be great to repro. I've only hit it once.

Thanks for merging the band-aid fix.

@chipweinberger
Copy link
Contributor Author

chipweinberger commented Feb 8, 2023

Oh one thing! When I hit this I was debugging a big memory leak in my OTA code! I was allocating a freertos semaphore in internal ram ~20 times a second.

My OTA would fail due to out of memory half way into the process.

Maybe ota + aes + memory exhaustion.

I am also using PSRAM, not sure if related.

@mahavirj
Copy link
Member

@chipweinberger

Is it possible to create a small reproducible use-case for this issue? So far, we have not been able to recreate this scenario on our side (tried different configurations including SPIRAM).

@chipweinberger
Copy link
Contributor Author

chipweinberger commented Mar 20, 2023

I appreciate your effort!

I remember reproducing it a couple times at least, but I imagine a minimal example will take awhile for me to create.

Did you try creating a new mutex every time new ota data arrived?

I was doing the update over wifi.

@mahavirj
Copy link
Member

Did you try creating a new mutex every time new ota data arrived?

No, I didn't. Do you mean OTA with low internal memory could run into this situation?

@chipweinberger
Copy link
Contributor Author

chipweinberger commented Mar 20, 2023

yes. low internal memory, which was then fully exhausted during the ota update.

I started the update with 20KB internal ram, and due to a bug was allocating and acquiring a mutex every 4KB of the ota update.

@dsiganos
Copy link

I am hitting this problem every time. Increasing the main stack size to 6000 bytes makes it work. I do not know what is eating up the stack memory yet. However this likely means that the problem happens when the task performing the AES operations runs out of stack.

espressif-bot pushed a commit that referenced this issue Sep 8, 2023
For certain data lengths, the last input descriptor was not getting appended
correctly and hence the EOF flag in the DMA descriptor link list was
set at incorrect location. This was resulting in the peripheral being
stalled expecting more data and eventually the code used to timeout
waiting for the AES completion interrupt.

Required configs for this issue:

CONFIG_MBEDTLS_HARDWARE_AES
CONFIG_SOC_AES_SUPPORT_DMA

This observation is similar to the issue reported in:
#10647

To recreate this issue, start the AES-GCM DMA operation with data length
12280 bytes and this should stall the operation forever.

In this fix, we are tracing the entire descriptor list and then appending the
extra bytes descriptor at correct position (as the last node).
espressif-bot pushed a commit that referenced this issue Sep 9, 2023
For certain data lengths, the last input descriptor was not getting appended
correctly and hence the EOF flag in the DMA descriptor link list was
set at incorrect location. This was resulting in the peripheral being
stalled expecting more data and eventually the code used to timeout
waiting for the AES completion interrupt.

Required configs for this issue:

CONFIG_MBEDTLS_HARDWARE_AES
CONFIG_SOC_AES_SUPPORT_DMA

This observation is similar to the issue reported in:
#10647

To recreate this issue, start the AES-GCM DMA operation with data length
12280 bytes and this should stall the operation forever.

In this fix, we are tracing the entire descriptor list and then appending the
extra bytes descriptor at correct position (as the last node).
espressif-bot pushed a commit that referenced this issue Sep 9, 2023
For certain data lengths, the last input descriptor was not getting appended
correctly and hence the EOF flag in the DMA descriptor link list was
set at incorrect location. This was resulting in the peripheral being
stalled expecting more data and eventually the code used to timeout
waiting for the AES completion interrupt.

Required configs for this issue:

CONFIG_MBEDTLS_HARDWARE_AES
CONFIG_SOC_AES_SUPPORT_DMA

This observation is similar to the issue reported in:
#10647

To recreate this issue, start the AES-GCM DMA operation with data length
12280 bytes and this should stall the operation forever.

In this fix, we are tracing the entire descriptor list and then appending the
extra bytes descriptor at correct position (as the last node).
@mahavirj
Copy link
Member

Update:

We have added few fixes in the AES DMA port layer recently, following are pointers for v4.4 branch

First one is the fix for heap corruption issue that may occur with specific AES dma lengths and second is specific to incorrect DMA descriptor setup. Please have a look at them, they are relevant to this issue. If you could not recreate "AES timeout" issue on latest release/v4.4 branch then we can close this tracker.

Thanks.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Sep 21, 2023
CommanderRedYT pushed a commit to CommanderRedYT/esp-idf that referenced this issue Sep 24, 2023
For certain data lengths, the last input descriptor was not getting appended
correctly and hence the EOF flag in the DMA descriptor link list was
set at incorrect location. This was resulting in the peripheral being
stalled expecting more data and eventually the code used to timeout
waiting for the AES completion interrupt.

Required configs for this issue:

CONFIG_MBEDTLS_HARDWARE_AES
CONFIG_SOC_AES_SUPPORT_DMA

This observation is similar to the issue reported in:
espressif#10647

To recreate this issue, start the AES-GCM DMA operation with data length
12280 bytes and this should stall the operation forever.

In this fix, we are tracing the entire descriptor list and then appending the
extra bytes descriptor at correct position (as the last node).
espressif-bot pushed a commit that referenced this issue Sep 30, 2023
For certain data lengths, the last input descriptor was not getting appended
correctly and hence the EOF flag in the DMA descriptor link list was
set at incorrect location. This was resulting in the peripheral being
stalled expecting more data and eventually the code used to timeout
waiting for the AES completion interrupt.

Required configs for this issue:

CONFIG_MBEDTLS_HARDWARE_AES
CONFIG_SOC_AES_SUPPORT_DMA

This observation is similar to the issue reported in:
#10647

To recreate this issue, start the AES-GCM DMA operation with data length
12280 bytes and this should stall the operation forever.

In this fix, we are tracing the entire descriptor list and then appending the
extra bytes descriptor at correct position (as the last node).
@espressif-bot espressif-bot added Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress labels Oct 3, 2023
@mahavirj mahavirj closed this as completed Oct 3, 2023
@chipweinberger
Copy link
Contributor Author

im still seeing this error in v5.1.2

as a workaround im using CONFIG_MBEDTLS_AES_USE_INTERRUPT=n

irriden added a commit to stakwork/sphinx-key that referenced this issue Dec 2, 2023
we otherwise hit the error below, when parsing mainnet block 812548:

Timed out waiting for completion of AES Interrupt
espressif/esp-idf#10647

This measure does not incur a memory, or latency cost
@kel30a
Copy link

kel30a commented Jan 19, 2024

im still seeing this error in v5.1.2

as a workaround im using CONFIG_MBEDTLS_AES_USE_INTERRUPT=n

Same. It was happening on 5.0.2 for me and still on 5.1.2 after upgrading.

@Maldus512
Copy link

I'm facing the same issue while generating the private key for a Rainmaker application. In order to make it work I have to increase ESP_RMAKER_CLAIM_TASK_STACK_SIZE to at least 22 * 1024 bytes (up from 10 * 1024).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

7 participants