Siddharth Ram
QuickBooks Engineering
8 min readApr 14, 2019

--

Root Cause Analysis — An Essential Systems skill

The April 2019 headlines have been dominated by the multiple Boeing 737 MAX failures which claimed so many lives. Superficial newspaper headlines talked about how there were software failures and how Boeing made a mistake by relying on just one sensor. But why did Boeing do that? To understand how they got there, you really have to dig deep. The excellent analysis by Trevor Sumner and Dave Kammeyer shows how deep rooted cultural and economic issues were behind the failure. It was a culture problem that lead to a cascade of bad decisions and as is often the case with shallow analysis, resulted in a part being blamed.

Engineering is supposed to be about precision and exactness. When a defect is uncovered and a fix applied, engineers need to ask ‘what caused this error?’ at a deeper level than ‘fixed a conditional clause’. Thinking deeper about the problem is a must in any engineering system that has more than a passing level of complexity. Not doing deeper analysis is a missed opportunity to prevent the same kind of problem from reoccurring

Engineering systems are more than software. Complexity comes from process, culture, team dynamics, market pressures and system designs. The root cause of problems are often related to culture and process. Yet we do not spend enough time thinking about it.

Every defect fix has the potential to wipe out a class of defects, if we think about it deeply enough. At Intuit, thinking long and hard about defects — especially those discovered as a production escape — is mandatory. This document shares the background of how we approach RCA’s and why it is critical in complex engineering systems.

Drift into failure

Sydney Deckers book is a must read for everyone who deals with complex systems, Engineering or not.

The more complex a system (and, by extension, the more complex its control structure), the more difficult it can become to map out the reverberations of changes (even carefully considered ones) throughout the rest of the system… Small changes somewhere in the system, or small variations in the initial state of a process, can lead to large consequences elsewhere.

Decisions make good local sense given the limited knowledge available to people in that part of the complex system. But invisible and unacknowledged reverberations of those decisions penetrate the complex system.

We can try to design a simple system, but complexity arises nevertheless, often due to the interactions of many simple parts. Each part of the organization sees its own simple part: few ever see the entire complex landscape, including market pressures, revenue and growth goals, a company’s attitude towards its customers- all of which drive decisions that result in additional complexity

The Humble RCA: A Panacea?

If complexity is inevitable, so are failures. When a failure does happen, the worst outcome would be to ‘fix the part’ and ignore the system that allowed the failure to happen. Here is a poorly written outcome from an RCA I reviewed:

There was a bug in the script that triggers an automated JVM restart when the AZ fails over which prevented some of the JVM instances from reconnecting promptly to the new RDS instance.

It does not address: What allowed the bug to exist? Was there undue pressure on the team to release early? Is the organization accepting of ‘release now, worry later’? Did the engineers not get support from management to delay the release? Was inadequate testing done? If so, why was it inadequate?

Answering these questions would help in wiping out a class of defects, rather than just fixing a bug. It does not solve the complexity problem, but helps manage it better by always going back to ‘what was behind what caused this problem”?

At Intuit, we have a framework for doing RCA’s. The rest of this document explores how we approach RCA’s.

RCA Principles.

  • Be vocally self critical. Every incident is a learning opportunity. No team, process or cultural attribute is above criticism
  • An ounce of prevention is worth a pound of cure. Heroics are disproportionately rewarded. Those who burn the midnight oil to fix production issues are seen as heroes. The real heroes are those who build in resiliency into the system and did not have to burn the midnight oil
  • Broken Systems, not broken components. A broken part is never the answer. What causes the broken part is a flaw in the system that allowed the broken part to manifest itself. Always look for the flaws in the system.

The 5 whys

the 5 Why’s framework is used to uncover root cause. Each ‘why’ forms the basis of the next question. This framework comes from the Toyota Motor Corp in the development of its manufacturing methodologies. 5 is a suggested level. You may need to go deeper (or maybe stay shallower) to get to root cause. The example above could have used the 5 whys to go deeper:

There was a bug in the script that triggers an automated JVM restart when the AZ fails over which prevented some of the JVM instances from reconnecting promptly to the new RDS instance.

  • Why was there a bug in the script that triggered the automated JVM restart?

During release 2019.21, a code review was skipped because the release was going out the same day

  • Why was the code review skipped?

There was a release going out the same day. If that was missed, there would be a 1 week delay in releasing.

  • Why would there be a 1 week delay?

The software process allows releases only once a week

  • Why does the process allow release once a week?

(…) — this will result in getting to a root cause — maybe the build system was never a priority for the organization, which it should be. A further check would result in understanding why this was not a priority.

To fix the bug in the script, you really had to understand why the build system was not a priority for the organization. Fixing this root cause — and understanding the mindset behind it — would eliminate a class of defects from happening again.

Note that often there are multiple underlying causes. This is done well by using Ishikawa diagrams (also called Fishbones)

The Swiss Cheese Model & its flaws

There are often several contributing factors to things that did or could go wrong. One way to think about them is via the Swiss cheese model. Wikipedia describes the Swiss cheese model as follows:

The Swiss cheese model of accident causation illustrates that, although many layers of defense lie between hazards and accidents, there are flaws in each layer that, if aligned, can allow the accident to occur.

Our goal is to understand the slices (e.g. systems, software, processes, mindsets, etc), where the holes exist (e.g. software bugs), and how, when, and where new holes can emerge (e.g. software upgrades, hardware failures). Then we can decide how to most effectively add new slices, reinforce existing slices, or plug existing holes with the lowest cost and highest impact.

Dekker cautions that the Swiss cheese model is flawed. It does not think about the relationships between the holes and

  • where the holes are or what they consist of,
  • why the holes are there in the first place,
  • why the holes change over time, both in size and location,
  • how the holes get to line up to produce an accident.

Premortems and FMEA

Postmortems are great for learning from failures. But we can also prevent failures by conducting premortems. Gary Klein describes a premortem in the Harvard Business Review as follows:

A premortem is the hypothetical opposite of a postmortem. A postmortem in a medical setting allows health professionals and the family to learn what caused a patient’s death. Everyone benefits except, of course, the patient. A premortem in a business setting comes at the beginning of a project rather than the end, so that the project can be improved rather than autopsied. Unlike a typical critiquing session, in which project team members are asked what might go wrong, the premortem operates on the assumption that the “patient” has died, and so asks what did go wrong. The team members’ task is to generate plausible reasons for the project’s failure.

A similar technique is the failure mode and effects analysis (FMEA). Wikipedia describes FMEA as follows:

Failure mode and effects analysis (FMEA) — also “failure modes” in many publications — was one of the first systematic techniques for failure analysis. It was developed by reliability engineers in the late 1950s to study problems that might arise from malfunctions of military systems. A FMEA is often the first step of a system reliability study. It involves reviewing as many components, assemblies, and subsystems as possible to identify failure modes, and their causes and effects.

For projects of significance (e.g. a component is moving to public cloud), a premortem is performed. What are the potential ways a failure could happen? What is our playbook if the failure does happen?

A system architecture diagram is useful for premortems. For each interface, think about failures that could happen, and what could cause that to happen, and how the recovery would take place. Mature organizations move from the theoretical to implementation of Chaos Engineering but this is exceedingly rare.

RCA’s are for Systems

RCA’s at Intuit are not just about software, they are for systems. If any complex tasks — for example, a conference — fails to do as expected, an RCA is often done at Intuit to get to the root cause. These are often reviewed at the Executive level, including by the CEO.

Commitment from Senior leadership

RCA’s uncover cultural issues, process issues and system architecture issues. For an RCA to be effective, Senior leaders need to be present for the RCA to make decisions on reprioritization, changes in policy and redeclaration of goals. At Intuit, Engineering Directors take responsibility for leading and reviewing RCA’s. In addition, RCA’s of Production incidents are shared with the entire engineering team for review and reflection.

In conclusion, RCA’s are an important part of product development and can help teams be more thoughtful, deliver insights and manage tbe complex systems we work in.

Acknowledgement

Luu Tran (now at Amazon) helped with the thinking around doing RCA’s at Intuit and getting to root cause.

--

--