This is part two in what I believe will be a five-part series of posts on human error. This post is based heavily on Chapter 3 of “To Err is Human” by the Institute of Medicine. The book reviews the current understanding of why medical mistakes happen and how its approach is applicable to other high hazard industries as well. A key theme is that legitimate liability and accountability concerns discourage reporting and deeper analysis of errors--which begs the question, "How can we learn from our mistakes?" This post covers why errors happen and distinguishes between active and latent errors.
Introduction
Human beings, in all lines of work, make errors. Errors can be prevented or reduced by designing systems that make it hard for people to do the wrong thing and easy for people to do the right thing. As an example, cars are designed so that drivers cannot start them while in reverse because that prevents accidents.
Process safety can be thought of as the results of a process being only those things that were expected. Error is defined as the failure of a planned action to be completed as intended or the use of a wrong plan to achieve an aim. According to noted expert James Reason in “Human Error,” errors depend on two kinds of failures: either the correct action does not proceed as intended (an error of execution) or the original intended action is not correct (an error of planning). Not all errors result in harm or are easily discovered when they occur.
Much can be learned from the analysis of errors, if you take the right approach. Care must be taken not to make processes too complex or cumbersome in the pursuit of better results because that only leads to different errors. Errors that do not result in harm also represent an important opportunity to identify system improvements having the potential to prevent adverse events. Preventing errors means improving system design at all levels to make it safer. Building safety into processes is a more effective way to reduce errors than blaming individuals (some experts, such as Deming, believe improving processes is the only way to improve quality15 ). To make significant process in safety, the focus of error investigations must shift from blaming individuals for past errors to a focus on preventing future errors by designing safety into the system. This does not mean that individuals can be careless or not held accountable. People must still be vigilant and held responsible for their actions. When an error occurs, blaming an individual does little to make the system safer and prevent someone else from committing the same error.
The goal of error analysis and reporting systems is to analyze the information they gather and identify ways to prevent future errors from occurring. The goal is not data collection. Collecting reports and not doing anything with the information serves no useful purpose. Adequate resources and other support must be provided for analysis and response to critical issues.
Key points
1. Some systems are more prone to accidents than others because of the way the components are tied together. Many complex and technological industries such as health care, ship repair, and high hazard industries are prone to accidents.
2. Much can be done to make systems more reliable and safe. When large systems fail, it is almost always due to multiple faults that occur together rather than a single or small group of bad actors.
3. One of the greatest contributors to accidents in any industry is human error. However, saying that an accident is due to human error is not the same as assigning blame because most human errors are induced by system failures. Humans commit errors for a variety of known and complicated reasons.
4. Latent errors or system failures pose the greatest threat to safety in a complex system because they can lead to or exacerbate operator errors. They are failures built into the system and present long before the active error. Latent errors are difficult for the people working in the system to see since they may be hidden in computers or layers of management and people become accustomed to working around the problem.
5. Many organizational responses to errors tend to focus on the active errors. Although this may sometimes be appropriate, in many cases it is not an effective way to make systems safer. If latent failures remain unaddressed, their accumulation actually makes the system more prone to future failure. Discovering and fixing latent failures and decreasing their duration are likely to have a greater effect on building safer systems than efforts to minimize active errors at the point at which they occur.
6. The application of human factors in other industries has successfully reduced errors.
Why Do Errors Happen?
The common initial reaction when errors occur is to find and blame it on someone. However, even apparently single events or errors are due most often to the convergence of multiple contributing factors. Blaming an individual does not change these factors and the same error is likely to recur. Preventing errors and improving safety require a systems approach in order to modify the conditions that contribute to errors. People working in nuclear power plants, health care, aviation, and other high hazard industries are among the most educated and dedicated workforce in any industry. The problem is not bad people; the problem is that the systems need to be made safer.
Accidents are a form of information about a system.3 They represent places in which the system failed and the breakdown resulted in harm. Charles Perrow's analysis of the accident at Three Mile Island identified how system design can cause or prevent accidents.4 James Reason extended the thinking by analyzing multiple accidents to examine the role of systems and the human contribution to accidents.5 "A system is a set of interdependent elements interacting to achieve a common aim. The elements may be both human and non-human (equipment, technologies, etc.)."
When large systems fail, it is often due to multiple faults that occur together in an unanticipated interaction,6 creating a chain of events in which the faults grow and evolve.7 Their accumulation results in an accident. "An accident is an event that involves damage to a defined system that disrupts the ongoing or future output of that system."8
The Challenger failed because of a combination of brittle O-ring seals, unexpected cold weather, reliance on the seals in the design of the boosters, and change in the roles of the contractor and NASA. Individually, no one factor caused the event, but when they came together, disaster struck. What most people don’t realize about teh Challenger failure is that NASA and its solid rocket booster prime contractor, Morton Thiokol, new about the deficient O-ring seal performance for years before the hypergolic burn (not explosion) of the Challenger. Why an organization as committed to safety and high performance such as NASA continued to fly shuttle missions in the face of mounting evidence of O-ring failure cannot be explained away by evil managers focused on schedule to the detriment of safety. Reading Dianne Vaughan’s The Challenge Launch Decision is a must for anyone seeking to understand human error and organizational failure.
The complex coincidences that cause systems to fail could rarely have been foreseen by the people involved. As a result, they are reviewed only in hindsight; however, knowing the outcome of an event influences how we assess past events.10 Hindsight bias means that things that were not seen or understood at the time of the accident seem obvious in retrospect. Hindsight bias also misleads a reviewer into simplifying the causes of an accident, highlighting a single element as the cause and overlooking multiple contributing factors.
In any industry, one of the greatest contributors to accidents is human error. Perrow has estimated that, on average, 60–80 percent of accidents involve human error. Even when equipment failure occurs, it can be exacerbated by human error.13 However, saying that an accident is due to human error is not the same as assigning blame.
Latent and Active Errors
In considering how humans contribute to error, it is important to distinguish between active and latent errors.16 Active errors occur at the level of the frontline operator, and their effects are felt almost immediately. This is sometimes called the sharp end.17 Latent errors tend to be removed from the direct control of the operator and include things such as poor design, incorrect installation, faulty maintenance, bad management decisions, and poorly structured organizations. These are called the blunt end. The active error is that the pilot crashed the plane. The latent error is that a previously undiscovered design malfunction caused the plane to roll unexpectedly in a way the pilot could not control and the plane crashed.
Latent errors pose the greatest threat to safety in a complex system because they are often unrecognized and have the capacity to result in multiple types of active errors. Analysis of the Challenger accident traced contributing events back nine years. In the Three Mile Island accident, latent errors were traced back two years.18 Latent errors can be difficult for the people working in the system to notice since the errors may be hidden in the design of routine processes in computer programs or in the structure or management of the organization. People also become accustomed to design defects and learn to work around them, so they are often not recognized.
In her book about the Challenger explosion, Vaughan describes the "normalization of deviance" in which small changes in behavior became the norm and expanded the boundaries so that additional deviations became acceptable.19 When deviant events become acceptable, the potential for errors is created because signals are overlooked or misinterpreted and accumulate without being noticed.
Current responses to errors tend to focus on the active errors by punishing individuals (e.g., firing or suing them), retraining or other responses aimed at preventing recurrence of the active error. Although a punitive response may be appropriate in some cases (e.g., deliberate malfeasance), it is not an effective way to prevent recurrence. Because large system failures represent latent failures coming together in unexpected ways, they appear to be unique in retrospect. Since the same mix of factors is unlikely to occur again, efforts to prevent specific active errors are not likely to make the system any safer.
James Reason notes
Rather than being the main instigators of an accident, operators tend to be inheritors of system defects created by poor design, incorrect installation, faulty maintenance, and bad management decisions. Their part is usually that of adding a final garnish to a lethal brew whose ingredients have already been long in the cooking.
For Further Reading
Understanding adverse events: human factors, James Reason
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1055294/pdf/qualhc00016-0008.pdf