Data Pipeline Availability and Reliability

So, it all starts with basics...

In distributed systems there are many networked components working together to perform some useful end to end function. All components, including the network, utility grid, and physical environment are subject to a Failure Rate. Given a large enough population of components, there is a high probability that there are always components in the system that are in a failed or failing state. This is nothing new. I've found there appears to be little real content, and actual analysis of some of the architectures we create.

In this article we’ll rationalize a practical way to compute Availability and compare different architectures availability. We will categorize components and connections structures, and transform that information into an availability computation, and then apply it to an example data pipeline we discussed in an earlier article.

This article is part III in a series. Part I, Simplify Your Data System with Logs, we discussed how logging infrastructures can bring data reliability, consistency to your data pipeline, and web services. In part II, Analyze a Typical Data Pipeline, we discuss a typical data pipeline, and introduce a method of to identify unnecessary components, critical checkpoints for reliability and availability, as well as hint how we desire the architectural components to be connected. In this article we build on the previous two articles and demonstrate how solutions that appear similar can be dramatically different in failure, availability, and data reliability.

Continuing On!

There are many ways we could model failure and recovery in our architecture. We assert there is a Failure Rate associated to every component, and that given samples of Failure Rate, we can create a Probability of Failure density function for each component, and then compute probability of failure based on how the components are connected in series, parallel, and combinations of both. Then we need to discuss the other side of the equation, which is recovery. Great, our system fails with some probability, but there is a huge difference between a failure for a seconds, vs. minutes vs. hours or days! Modeling this in the language of statistics is precise, but not necessarily practical to explain in a few hundred words.

Focus ...

IMHO, What we are really after, is an evaluation technique to compare different architectures, and then actually measure it in some meaningful way. Let’s start by classifying failure. In the simplest system where we have n components serially connected, failure is any one of these components failing, preventing overall system operation.

Ohh, but what about degraded performance?

Great question! It depends. Let's assume we consider our system has failed when our customer's expectations are not being met. Let's assume that we had the foresight to create a Service Level Agreement (SLA). So, we will use our SLA to decide if the system is in failure mode or not, and similarly we will extend the service level SLA to individual sub-components as appropriate to assess if they are in a state of failure or not. Now we have a working model of how to evaluate our system and classify failure of the system and it's sub-components.

So, at small scale, typically the components are serially connected and this exercise is trivial, but what about medium and large? That’s where we introduce series and parallel connections, and things get interesting! Note the difference in Failure Rate (Lambda) in the below diagram. Huge difference between serial (2)Lambda (2x Failure Rate) and parallel (2/3)Lambda (2/3x Failure Rate)!

Series, courtesy of Alion Science and Technology

Parallel, courtesy of Alion Science and Technology

I’m going to introduce a method of modeling using MTBF (Mean Time Between Failure), MDT (Mean Down Time), and of MTTR (Mean Time To Recovery) because I think it is easier to understand. There are two states in the below diagram, up (running), and down (failed) with time represented on the horizontal axis. (visuals courtesy of Wikipedia)

By referring to the figure above, the MTBF of a component is the sum of the lengths of the operational periods divided by the number of observed failures:

In a similar manner, MDT can be defined as the sum of lengths of failure periods divided by number of observed failures.

Let's take a look at our original diagram and annotate the different probabilities of failure, P(f), per connection. We'll assume P(f) of hardware instances is included in the process/thread P(f).

Wow, that is a lot of probabilities to compute. If all the communication were serial (failure of any component causes system failure) then we could simply use this formula to compute MTBF

Serially Connected System

MTBF for system where components c1;c2 are arranged in series.

MDT for a system where components c1;c2 are arranged in series.

Courtesy Wikipedia, the MTBF, MDT for two components c1;c2 arranged in series(for instance hard drives, servers, processes, etc)[4][5] . Wikipedia offers the below.

Through successive application of these four formulae, the MTBF and MDT of any network of repairable components can be computed, provided that the MTBF and MDT is known for each component. In a special but all-important case of several serial components, MTBF calculation can be easily generalized into

*which can be shown by induction, [6] and likewise since the formula for the mdt of two components in parallel is identical to that of the MTBF for two components in series.*

We'll put this latter part about mdt(c1 || ... || cn) for parallel systems on our notepad for later reference.

So we can compute MBTF of series system by simply computing each link as series or parallel and summing. The MDT for series is the same as any one component failure. Let’s say our MTBF is 3 years or 26,280 hours, and our MDT is 0.5 hours, so our MTBF(c1;c1) = 13,140 hours, MDT(c1;c2) = 0.5 hours. For the system in the diagram we get MTBF(system) = 1643 hours, MDT(system) = 0.5 hours.

Wow, each component has a MTBF of 3 years, how did we end up with a failure every 68 days! That would be the magic of serially connected systems!

Hey wait! My system is not all Series connected!

However, in our system after using our secret decoder wheel from Part II, Analyze a Typical Data Pipeline we have zones that are serial connected, and zones that are parallel connected, which make a significant difference in the overall MTBF and MDT. Consider the below diagram where series zones (A, C, E), and parallel zones (B, D, F) are marked accordingly so we can see the differences. Arguably, Zone D is a series zone as flume did not dynamically reconfigure if the configured Agent failed, but we will ignore that, and compute Zone D as parallel for now.

So let’s model our pipelines.

Parallel connected systems

MTBF for a system where components c1||c2 are arranged in parallel.

MDT for a system where components c1||c2 are arranged in parallel.

where c1||c2 is the network in which the components are arranged in parallel, and PF(c,t) is the probability of failure of component c during "vulnerability window" t. [4][5]

So for our new system MTBF(system) = 2628 hours, MDT(system) = 0.45 hours. So by adding a few parallel components we have increased the MTBF by 60% and reduced the MDT by -27%!

Now I promised to compare this to our simplified system presented in the article Analyze a Typical Data Pipeline. Recall the below diagram.

We’ll break this into serial and parallel zones using our secret decoder wheel, just as we did previously.

Using the same process, we’ll compute the segments, again for the sake of argument, we will assume that we have the same MTBF of 3 years or 26,280 hours, and MDT of 0.5 hours for each component instance as the previous example.

The MTBF(c1;c2) for each Series connection is: 13,140 hours

The MTBF(c1||c2) for each Parallel connection is: 690,638,400 hours!

Summing MTBF of each connection gives us (1/13,140) + 4(1/690,638,400))^-1 = 13,139 hours

You can see that our MTBF is limited by the first Series connection, and not by the Parallel connected components. 1/13,140 >> 4/690,638,400

Similarly, for MDT is 0.5 hours for the series connection, and 0.25 for the parallel connections.

Comparing our original system MTBF(system) = 1643 hours, MDT(system) = 0.5 hours to our simplified system MTBF(system) = 13,139, MTD(system) = 0.5 hours we see an 8x improvement in availability.

So obviously, we prefer truly parallel connected components with checkpoints when possible!

Thanks for reading! I hope you found this helpful. if you have any questions reach out to me on linked in.

~Miles

The views in this article are my own based on my experiences in the last 10 years.