Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for deterministic ID strategy to simplify recovery from corrupted registry state? #4255

Closed
dmariassy opened this issue Jan 25, 2024 · 3 comments
Assignees
Labels
type/enhancement New feature or request

Comments

@dmariassy
Copy link

Feature or Problem Description

Imagine the following scenario:

  1. An operator accidentally TRUNCATEs all registry data in Postgres
  2. Kafka producers continue to run, and re-register the schemas during serialization
  3. The "newly" registered schemas will have different IDs than their predecessors, even if their content is identical to previously seen versions
  4. The error is detected, the registry is taken down for maintenance, the previous state of the registry is restored from backup
  5. The producers continue to add the pre-incident schema IDs to the records

Result: The data produced during the incident will become incomprehensible to consumers after recovery. (If at all possible) affected producers will need to be rewound to a point in time that precedes the incident's start so they can re-emit records with correct schema IDs.

Proposed Solution

If the registry could uniquely identify and store schemas based on their content (e.g. by using the schema hash as an ID), the IDs generated during the incident would match those that existed before the data corruption. As a result, the data that would get produced and serialized during the incident would continue to make sense to consumers even after recovery.

Additional Context

Is there an existing approach for managing the risks of a temporary corruption of registry state?

@dmariassy dmariassy added the type/enhancement New feature or request label Jan 25, 2024
@apicurio-bot
Copy link

apicurio-bot bot commented Jan 25, 2024

Thank you for reporting an issue!

Pinging @carlesarnal to respond or triage.

@carlesarnal
Copy link
Member

There is, depending on what you want to achieve, an option might be to use just the topic id strategy, so, even if a new schema is registered, the resolution will still use the correct one. We also have a strategy to use a contentHash identifier for the content that is calculated purely on the content itself, so that would be another alternative. Now looking at it it's not well documented, so let me do some improvements there and I'll add them here.

Thanks

@carlesarnal carlesarnal self-assigned this Apr 9, 2024
@carlesarnal
Copy link
Member

Ok, we have pretty extensive documentation here on the different set of strategies for schema resolution.

Let me know if further clarification is needed.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

2 participants