Improving API Reliability by Decoupling from the Schema Registry

Andrii Vandych
4 min readJul 31, 2024

--

Note: Schema Registry is a very good product that supports high availability setups, and its client libraries do an excellent job with caching schemas. The potential issues discussed here should only be considered if you are extremely focused on improving reliability to the highest possible standards.

Modern systems often rely on a combination of synchronous APIs, event buses, and asynchronous consumers to handle complex workflows efficiently. These components frequently depend on a schema registry to ensure data consistency and format validation across different services.

In this architecture, the web service processes user requests and fetches schemas from the schema registry to ensure data is correctly formatted. The business data and events are stored together in a database within the same transaction to maintain consistency. These events are then propagated to Kafka, which distributes them to various consumer services. Each consumer fetches the necessary schema from the registry to process the events correctly.

What Happens if the Schema Registry Fails?

Asynchronous flows can tolerate delays caused by the schema registry’s unavailability. While events may not be processed immediately, they can be queued and processed once the schema registry recovers. This delayed processing is often acceptable in such architectures because the system can catch up and handle the backlog of events without significant issues. System recovery is straightforward: when the schema registry is back online, the queued events are processed normally.

However, for synchronous APIs, the impact is much worse. The inability to validate or serialize data means that user requests cannot be served at all, leading to a complete halt in service. This makes it crucial to have strategies in place to mitigate the dependency on the schema registry for synchronous flows, ensuring higher availability and reliability for real-time operations.

Why Do We Need Schema Registry on Producer?

Why do we need a schema registry on the producer side if we can always know which schema to use by generating the schema from our POCO object or including the schema as a resource in our application?

The reason is that, to make consumers work correctly, they need to receive the schema ID along with the message. This schema ID is assigned by the Schema Registry. Without this ID, consumers wouldn’t know which schema version to use for deserialization.

You might wonder why we can’t just use schema fingerprints as per the AVRO specification. Unfortunately, the Schema Registry doesn’t implement this part of the AVRO specification. Instead, it generates IDs using an internal algorithm.

This means producers must connect to the schema registry to obtain the schema ID, even if they already have the schema locally. This dependency is due to the current implementation of the schema registry, as highlighted in this issue.

What Can We Do?

What can we do to solve this problem? The answer is to decouple the producer from the schema registry. How do we do this? We move the dependency on the schema registry from the producer to our CDC/Outbox application.

By implementing schema fingerprinting, the producer can calculate the fingerprint of the schema it uses and include it along with the serialized message. The CDC/Outbox application then uses this fingerprint to match the schema from the schema registry (by generating a map of schema fingerprint to schema ID) before pushing the message to Kafka. This way, the event is written with the correct schema ID without the producer needing to contact the schema registry directly.

Here’s how it works:

  1. Producer: Calculates the fingerprint of the schema and includes it with the serialized message.
  2. CDC/Outbox: Uses the fingerprint to look up the schema ID from the schema registry.
  3. CDC/Outbox: Writes the event to Kafka with the correct schema ID.

You can implement this change by writing custom code for Debezium, a popular CDC tool, or by modifying your own Outbox poller implementation if you have one. This approach ensures that the producer remains decoupled and can operate without needing to directly contact the schema registry, improving overall system reliability and reducing potential points of failure.

What Would That Give Us?

By implementing such decoupling, in case of a Schema Registry failure, the user request would still be handled correctly. However, there would be a delay in publishing messages to Kafka. For most systems, this would be a significant improvement. The critical part of handling user requests remains unaffected, ensuring continuity of service.

It’s important to note that any schema used by the producer should be published to the Schema Registry beforehand. This can be done during the deployment process. Ensuring schemas are pre-registered is crucial to avoid runtime failures due to missing schemas.

However, it’s worth repeating that these considerations are only necessary in extreme cases. The Schema Registry is a reliable product, supporting high availability setups and effective client library caching. Thus, decoupling should only be pursued if you are extremely focused on achieving the highest possible reliability.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response