Modeling & Scaling Kafka Clusters for Availability and Fault Tolerance

More than once I've heard from peers that they consider Kafka not scalable, or is unreliable or just simply unavailable "at our scale". This is curious to me as I've had the exact opposite experience. I typically ask, "Why, what happened?" Typically, responses vary, the problem starts with lack of precise knowledge of what we are asking the cluster to do, and the optics to verify cluster performance and the actual the ask, as well as understand what the cluster is actually doing (or attempting to do).

The focus of this article is to rationalize and provide a model to scale Kafka clusters.

What is the Problem?

When sizing a Kafka cluster we have to decide:

How much data do we need the Kafka cluster to ingest, store, & consume?
How many broker instances to provision and what hardware type?
How brokers will be configured?
How much storage will each broker need?
What is the expected availability, reliability and scale of the cluster?

For now we are going to ignore scalability problems that are either upstream or downstream of Kafka cluster. We will assume Producers can write data at the maximum ingest rate. We will assume Consumer Groups can consume data at the maximum data rates. In real life we would want to know data rates of consumption and production, as well as the latency to write to Kafka as well as the latency Consumer Groups are behind topic publishing.

Let's classify Kafka fault recovery into four main categories:

Kafka has one or more faults active
Partition Leader election recovery
ISR replication recovery
Broker recovery
Partition Leaders/Followers are not evenly distributed across all brokers

Constant State of Failure?

Let's begin with addressing the question: "Are we reasonably within cluster scale limits?" We will start with rough model for us to work with:

data ingested
data replicated
data read
data being recovered

Wait, what, recovered? Yes, we need to assume the cluster will be in a constant state of failure, and plan for the built in cluster failure recovery to be active in our capacity plan!

Data Flight Path

Walking through the data flow, we have:

Producers publish chunks of data to Topics.
Topics are composed of many Partitions.
A single Partition on a Broker is an ISR.
Copies of an ISR on multiple Brokers make a single Partition.
The target number of ISR's is the replica count.
Partitions can be distributed optionally by a Partition Key for a given topic.

There are other resources that go through these relationships in detail, so I won't do that here.

A Broker is either a Leader (3) for a Partition, or a Follower (6). There can only be one broker that is a leader for a partition at any given time. If a Follower Broker fails then the Leader Broker for that Partition must find another Follower Broker to satisfy the replication policy for a Partition and replicate the Partition to a new Broker. The Leader is responsible to receive all Publisher writes, Consumer (7) reads, and replication traffic for the Partition(s) it is a Leader of. If the Partition leader node fails then the surviving Follower Brokers must undergo a Leader election, and consequently take on the role of Leader for that Partition. Typically, the strategy is to distribute partition leadership as evenly as possible throughout the entire Kafka cluster of brokers. Similarly distribute partition followers across brokers in different fault zones, ideally one partition ISR per availability zone. A Consumer can read from one or more partitions. To say this differently you cannot have have more consumers in a Consumer Group (8) than partitions, the extra consumers will have nothing to read!

This system is incredibly fault tolerant, providing the rate of failures is less than the Mean Time To Recovery MTTR. If the MTTR is longer than Mean Time Between Failure MTBF, then the cluster undergo a cascade of failures. What makes this tricky is that peak traffic can change MTTR significantly, and a single ill timed failure could bring down the whole cluster!

This is the whole point of capacity planning ... to ensure your infrastructure survives probable failures during peak traffic.

Let's take a moment to introduce Kafka cluster model factors:

Committed Utilization : Estimates the utilization of Broker resources like: storage, network bandwidth, RAM and cost.

Desired Load : Input the amount of data we intend to write to Kafka, and how many consumer groups we will have, and if those consumer groups are streaming or batch in nature.

Desired Disaster Recovery : Here we input desired DR related configuration variables for Kafka like: What is our maximum lag for a streaming consumer? How long we retain data in a topic for? How many ISR replicas do we have?

Kafka Configuration : Here we input things like: number of broker instances, partitions to a topic, number of ISR replicas, and so on to model resource utilization in the cluster

Instance Capacity : Here we input factors per broker like: RAM, Network capacity, price, storage and storage performance characteristics.

Yes, there are a lot a variables to model, and this post was my challenge for some time, to find a way to express the model simply, and make it useful for people trying to get something done. So here is the model, so all you have to do is plug in your numbers! Hopefully the descriptions add enough context for this to be useful for you. The orange boxes are to input data. The boxes with orange text are intended to be read only.

Kafka Capacity Model