Analyze a Typical Data System

My previous post, Simplify Your Data System with Logs, discussed solving data reliability and consistency problems using logs. This post will walk through analyzing a typical data pipeline, and avoiding common issues by simplifying you data pipeline to achieve greater availability, reliability and scalability.

In the example pipeline analyzed there are 10s of millions of clients, 30 to 40 Front End Servers (FES), and 8 Flume Agents processing 17,000 events/second. The FES have many complexities that we will discuss in a future article for those who are interested. For now, we will just say that the FES receive, validate, transform data, and produce multiple data feeds, but for now we will model that as one feed. Data is aggregated from FES to a Flume Agent, then moved to S3. The design tries to avoid CAP tradeoffs by buffering data locally before transmitting to the next component.

No distributed system is safe from network failures, thus network partitioning generally has to be tolerated. In the presence of a partition, one is then left with two options: **consistency** or **availability**. When choosing consistency over availability, the system will return an error or a time out if particular information cannot be guaranteed to be up to date due to network partitioning. When choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date due to network partitioning. *[4] In the absence of network failure – that is, when the distributed system is running normally – both availability and consistency can be satisfied. At a high level it seems like a good design, right?_

What types of problems will we solve?

Some of the problems encountered are in the current design are: Availability, Latency, Consistency, Throughput, and Permanent Loss of Data if data fails transformation. Some transformations require knowledge of the session, requiring the load balancer to use “sticky sessions.” Data is batched, causing transient high load, and variance in data delivery causes breaking data SLAs. In this form, data transported by this pipeline is only suitable for latent analysis and the pipeline is too slow for any real-time product related features. Engineering Agility is affected because there is a massive area to regression test for the FES component, and most of the problems of data consistency and throughput are difficult to model in a test environment. Due to the lack of any synchronization, it is difficult to validate components by updating 1 of n instances and comparing. To be fair, some of these problems can be solved without significantly changing the architecture.

On with the Analysis

First, let’s talk about the analysis method. Let's categorize the pipeline into functions. The importance of this is to understand some reliability features associated to various pipeline functions. We might break this pipeline down into its internal components of significance, based on application processes, giving us this communication diagram.

To make this slightly more fun and less confusing, let me introduce some visual aids to make the analysis easier. Credit to my friend Yongjia Wang who, to my knowledge, invented this method of visual representation.

There are Sources, Sinks, Push, Pull, and Push & Pull Connectors. The point of this exercise is to identify the separate components in the pipeline and identify where we need to measure consistency, and implement Checkpoints. Often this type of diagram exposes in-efficiencies we might have previously overlooked. So now with our secret decoder wheel, we can analyze our pipeline.

What is a Checkpoint, and why do I care?

Consider a Checkpoint as a door to a building and we want your location to be outside or inside. When moving data, we need to consider what happens when computers crash, networks fail and so on. Then it becomes really important to be on one side of the door or the other! Duplicates occur when we place the Checkpoint after the door inside the building. We may actually push the data through the door, but fail to update the Checkpoint before a crash, causing us to send the data again on restart. Conversely missing data happens when we place the Checkpoint before the door outside the building, and we crash before the data actually goes through the door, causing us to think the data is already through the door when we restart. It seems obvious, but when dealing with data in flight in multi-threaded, buffered asynchronous system, and distributed systems, atomic operations like placing the Checkpoint in the door are generally not possible, and it can become difficult to determine exact state when things die in unexpected ways because data can be read, but not written and acknowledged. Enter the world of Checkpoints. The goal is to simply say we have acknowledgement that the data is safely through the door, and if the system dies a horrific death we know exactly where to restart with the acceptable margin of error to either produce a duplicate, or potentially miss a datum.

Where are the Checkpoints?

So where do we place the Checkpoints, and how many do we need to ensure data reliability? There are some obvious places. Most folks think of Checkpointing at persistent Storage right away. But wait, can we actually Checkpoint Storage or a Sink? Only the writer (Active Push) can determine if the Sink contains what it expects it to. Some functions like Active Pull ("Did we get the data?") are not interesting Checkpoints because it is only the start of the transfer. When Active Pull and Push are combined, then we only need to Checkpoint the Active Push ("Did we send the data?"). If our transfer contains a transformation, then we may want to monitor both what is read, and written to ensure our transformation did not drop data.

Let’s take a moment to discuss the Active Push positions in our above diagram. First, we have the client itself as the originator of data, and first Active Push. Now depending on the client, we may or may not have an ability to provide a Checkpoint. For example, it could be a web client and as soon as the customer navigates away from the site all the running code is terminated, and any data not sent may never be sent. So, check-pointing data client is “it depends”. Now we have the load balancer, which typically is a stateless service designed to be transparent, which eliminates the need, or value of a Checkpoint. The FES is our first real Checkpoint opportunity for Active Push. So now on top of the initial FES complexity of receive, validate, transform, Active Push we add Checkpoint. Flume Client is an Active Push to Flume Agent, and similarly Flume Agent is an Active Push to local disk. Finally, S3 Loader is an Active Push to S3. So in summary, there are 5 Active Push that we can possibly Checkpoint, and 4 Sinks.

Can we do better?

When we look at the complexity of the model we have to ask, is it really necessary, and/or good design to have so many Active Push Connectors and Sinks? Let's try this simplified design, and introduce a Checkpoint log mechanism like Kafka that makes it easier for Active Push components to build Checkpoints and persist data reliably. Kafka is a special Sink because it will allow us to address some of our above mentioned problems, but I'll get to that point later.

Using the same method as above, we have 3 Active Push vs. 5 Active Push components, and 1 vs. 3 Sinks. How significant is this? We have (3-5)/5, or 40% less Checkpoints + (1-3)/3 or 66% less Sinks!

Let’s take a moment here to discuss why it is significant to reduce Checkpoints by discussing scalability, availability and reliability. Starting with availability, we model the pipeline as serially connected devices with some probability of failure using Failure Rate. For the sake of argument, let us assume all components have the same failure rate, lambda. In practice the components definitely do not have the same failure rate, but that is the next article on computing Reliability & Availability! To compute a comparative failure rate we can simply add the failure rate of each component.

So given a choice, I would prefer to start with only 4x vs. 8x lambda failure rate!

Scalability is the ability of the system to handle more throughput, or data volume. Some components scale linearly, meaning that as instances are added the throughput increase according to some ratio of throughput per instance. Yep, nothing new there, but where things typically get interesting is what happens to availability and reliability when scaling. Enter the CAP Theorem, and related research. Some zones, like data persistence zones, need special attention, as there will be trade-offs between scale, availability, and consistency (or reliability), and it is important that we are aware of them and manage accordingly!

What about our original gripes?

Let’s revisit our earlier gripes about our previous data pipeline, and see how many we have addressed by re-architecting the solution.

Data Availability is addressed because (1) 4x vs 8x, and (2) Kafka is a distributed system allowing parallelism (more on the importance of that later!) producing better availability than that of a single instance.

Data Latency/consistency is addressed because there is only one persistence point in the pipe, Kafka, which is a special latency/reliability zone. Data is available by record within milliseconds of commit. Data reliability is configurable to meet your needs.

Data throughput, and consistency are now tunable factors based on the requirements and hardware available.

The permanent loss of data due to failed FES transformation problem still exists, but we can now decide to consume and transform raw data after it has landed in the Kafka Sink. Ooooh options, that’s nice! We may decide to load Kafka partitions by sessions, so we can use the transfer of data to Kafka to sort our data into sessions.

Batching of data is now no longer necessary, we can perform all operations as a stream, reducing end to end latency.

Our data pipeline is now suitable for near-real-time product features to consume data, and we’ve gained the reliability and consistency thanks to logs!

Engineering Agility

On Engineering Agility… in this design we have paved the way for de-constructing complex components, like the FES, by allowing raw, verified and transformed data streams. Not only can these components be removed from the FES, but they can be tested independently by simply subscribing to a Kafka topic, and run in parallel to current production. We can do performance testing by setting start of topic to event threshold and let test variant consume as fast as possible. This is HUGE as it has all the benefits of testing in production, but none of the risk (assuming you have sufficient capacity in Kafka)!

In Summary

Simplifying an existing system can yield significant benefits. Using a sink that facilitates treating data as atomic log entries like Kafka represents a special data availability, reliability, scale, and testing zone, where we can tune the application, and directly manage availability, consistency, scalability, and partitioning. In the next article we’ll go through a mathematical rationalization that I hope will make the availability claims clear, without digging too deeply into statistical science and modeling.

Thanks for reading, please do leave me feedback or contact me personally. I want to write about things folks find useful!

The views in this article are my own.