Creating a Resilience Kafka producer & consumer environment.
Kafka is a great tool to start using events as a medium in communicating between services in your system. Event-based allowed us to kinda “lazily” process any message coming to your app because everything happens in an asynchronous manner. Kafka, by design abstracts the scalability on its partition mechanism, makes scalability as the first-class citizen aspect for whoever wants to utilize it.
In order to create a resilience app or service that will utilize Kafka, there are a couple of questions you need to answer.
- How the sequence of events/messages affect your app?
- How does your app deal with double messages?
- What the app should do when fail to process a message?
Sequence Matter #
If the producer app is responsible for a state-machine of an object, for instance; an Order in shopping cart app, it has a status attribute where indicate the latest state of an order, it can be placed, paid, refuded or canceled and the app also will maintain the validity when an Order can be updated from placed to refunded. There are two things heavily required to be present in the way of producing the event; they are :
1. Having “version” attribute in the event. #
This will help the consumer that only care with the latest version of an Order; this consumer doesn’t care about the sequence, it only cares about the “correctness” of the latest version of an Order it has.
Imagine there is an event placed order, but the consumer fails to process it due to its local error such as glitch in DB connection. Yet, the consumer chooses to move to the next event and put the place event in DLQ, apparently the issue no longer there, and also the next message is paid event for the same order as the previous one.
this consumer no longer needs the placed event to be processed because paid is the latest version of the Order status, so when the placed event “resurrected” from DLQ, this app can safely ignore the event just by comparing the version attribute.
2. Use consistent hashing for message’s key #
the producer app guarantees it will publish the correct sequence of status changes in the Order. Yet Kafka has partition mechanism that guarantees the sequence of messages in a partition not in the topic as a whole, so partition can be consumed concurrently, this is where Kafka abstracts the scalability. Since a partition can be consumed concurrently, the producer app responsible for producing the message of one Order record status changes into the same partition over time. To make this happen producer need to produce each message with the same key for one Order and map the key consistently into one partition
Idempotency is the Key #
Kafka was trying so hard to make exactly-one delivery of producing a message possible, but with gradually commit offset mechanism on the consumer side, that premise won’t help much in assuring the consumer will receive an exactly-one message. The message basically pulled by the consumer not pushed by Kafka, so consumer is responsible to track the checkpoint where is the latest offset it has consumed, and this process prone to error and Kafka won’t help you much in this area.
Committing messages ideally should happen once we’ve done processing the message, and Kafka gives all the options how we want to commit, it can be in batch by size, batch by time, automatically, manually each message, and also we have the option to do it asynchronously.
when they give you a lot of different option, basically they can’t guarantee you anything and you’re the one who in full control about it.
will all that options to commit a message, consuming double-messages is inevitable due to network error during committing offset, or maybe simply the offset not committed gracefully during the application exit and etc.
Fault is inevitable #
In an event-driven world, you will have to accept that failures are inevitable. There will be cases where your application fails to process a message or event. The application may crash, or there may be transient network issues that prevent the application from connecting to the Kafka cluster.
The strategy for handling these failures largely depends on the importance and nature of the event or message. There are a few strategies that can be employed in such cases:
Dead-Letter Queues (DLQ): The Dead-Letter Queue is a powerful concept where events that cannot be processed are redirected. These can be consumed by another application for further processing, potentially with more error handling logic or manual intervention.
Retry Policies: Implementing a retry policy can be a good strategy for transient errors. Exponential backoff and jitter strategies can be used to avoid the thundering herd problem and make your system more resilient.
Circuit Breakers: A circuit breaker can be used to avoid calling a known failing service. If your application has dependencies on other services and those services are failing, a circuit breaker can halt requests to the failing service, preventing further resource consumption and potential cascading failures.
Error Metrics & Logging: Implementing robust error logging and tracking metrics is crucial. Not only does it help in debugging the issues when they occur, but also it helps to proactively monitor the system to prevent future failures. It’s also useful to have alerts on these metrics to notify when a certain threshold is exceeded.
Event Sourcing: In the case of catastrophic failures where the system’s state is lost, event sourcing can be used to rebuild the state from the event log.
In conclusion, building a resilient Kafka producer and consumer environment requires a lot more than just careful sequencing of messages and idempotency. It requires a thorough understanding of the domain and potential error scenarios, and implementing a robust error handling and recovery mechanism that suits your application’s needs.