All Outages Like British Air are ALWAYS Human Error!

6/6/2017

Finger-pointing is a natural consequence of breaches and system interruptions. Power failure? Component shutdown? Programming glitch? The biggest lesson organizations should learn from incidents like the British Air disruption is that they are all, at their root, caused by human error. Hardware and software are not infallible, but ‘wetware’ is the ultimate cause of every – every – disruption. Mitigating this factor will yield dividends for any organization seeking to reduce Risk.

Why are humans the prime point of failure? Here’s why:

Humans are the Cog in Risk Management and Due Diligence
Risk Management by its very nature is non-deterministic and heuristic. It can’t be programmed and it can’t be learned from a book. It’s not a science, it has elements of art and practice to it. People must acquire the knowledge and apply it. Systems and bots can be programmed to react to information, collect, analyze and report it, but the world is constantly changing, very situation is different, and judgment and assessment will always be essential… and never perfect.

Humans conflate Availability with Contingency
Many outages are caused or exacerbated because ‘fail-proof’ systems failed. Many data centers incorporate High Availability – redundancy, hardening, segregated graceful failover – and assume that because “It Can Never Fail” there is no need for Disaster Recovery or Business Continuity. When they do plan, they skip over or go light on essential elements. Here’s a real-life example: a major global corporation hosted their production environment in a data center isolated in the US Southwest. Seismically inactive, no hazards, politically benign and calm weather. The data center itself had dual power grids, redundant generators, redundant telecom, redundant water. High security, compartmentalized access, biometrics, the works. Uptime Institute Tier 4, everything down to the power into the racks was High Availability. The only way the data center could go dark was if two technicians pulled two master switches, which were located far enough apart that one person could not do it but close enough to be within eyeshot of each other. Can you guess what happened? One tech pulled it while the other was working on it. Whoops! They turned the power back on. Then they said, “Wait, we should not have done that, it’s not procedure” and shut it back off. The power bouncing up and down caused further equipment disruption. Long story short, it took the organization the better part of a day to restore the environment. Why? Why didn’t they activate their Disaster Recovery Plan? Because they did not have confidence in it, particularly the ‘return to normal’ part of the plan (a commonly-overlooked area). Even if you think that your Plan A is 100%, you still need a Plan B.

Humans direct their own preparedness (or not)
Hardware and software can self-detect anomalies and self-schedule maintenance. What they can’t do is refuse to take training, avoid practice or assume their own superiority. Machines do not have hubris. People do. In the outage described above, the IT organization response was delayed by almost two hours and was initially sluggish. Why? The organization had a mechanism for instantaneous multi-channel communications to spin up teams, but it was not utilized and the teams had to fall back to manual call trees. More on the actual incident: why wasn’t the automated system used? The Incident Manager on duty (who was on a site visit to the data center at the time) had decided that they did not need the training on how to activate it. They knew how to send the initiation by company email, so why waste time taking the training module? Well, when the environment went dark, so did company email, and the manager never learned the alternate procedures for launch. Another factor impeding the recovery was the lack of practice in hand-on recovery work under actual conditions. The teams on site were hindered by small things such as inefficient coordination, bad acoustics preventing hearing of directions and other issues that could have been identified by exercising and mitigated before the crisis hit.

Humans have cognitive biases
People by nature are hard-wired not to understand Risk. We have built-in psychological flaws that impede our understanding. Confirmation Bias – the tendency to focus on data that corroborates our decisions, Zero Risk Bias – preferring small eliminations of risk over larger incremental reductions, Anchoring – fixating on the last threat not future threats (hey TSA, can we have our pen knives and knitting needles back yet?), Normalcy Bias – under-estimating risks that are not part of our everyday understanding, Availability Bias – over-estimating risks that are cognitively or emotionally present in our minds, Base Rate Bias – miscalculating probability due to ignorance of total data, Texas Sharpshooter Fallacy – engineering data to support a post-facto assumption, all conspire against good judgment and decision-making on Risk. Machines and bots do not have cognitive biases… but they are developed, programmed and maintained by us imperfect humans who do.

In defense of humans, people do have one trait that machines and bots do not: Humans are heroic. Hardware and software can give 100% to a task, but only people can reach beyond themselves and give 110%. In the outage mentioned above, the success of the recovery was due to people climbing out of their beds in the middle of the night, driving, carpooling, hitching a ride or pedaling bicycles to the office and working feverishly doing yeoman duty to get ‘their’ systems back.

Conclusion
Your organizations’ human capital can be your biggest Risk liability… or your biggest asset. You can build the controls and practices to mitigate the deficiencies above. It all depends on how much investment you wish to make in terms of recruiting the best, motivating them, providing state-of-the-art training and development (yes, these are two different things), giving them the tools they need (including other people), and most importantly, cultivating a culture of Resiliency in order to provide the Risk Management that your stakeholders, customers, and brand deserve.

0 Comments

Your comment will be posted after it is approved.

From the Managing Principal

Thought leadership, observations and more

All Outages Like British Air are ALWAYS Human Error!

Leave a Reply.

From the Managing Principal

Archives

Categories