For The Want of A Nail, The Kingdom Was Lost

For want of a nail the shoe was lost,
for want of a shoe the horse was lost,
for want of a horse the knight was lost,
for want of a knight the battle was lost,
for want of a battle the kingdom was lost.
So a kingdom was lost—all for want of a nail.

For a 2019 rewriting of the classic aphorism you could replace ‘nail’ with ‘sensor’ and ‘kingdom’ with ‘Boeing’.

On July 20, 2011, American Airlines announced an order for 460 jets including 130 of the recently launched 130 A320neos (new engine option), breaking a monopoly Boeing had enjoyed with the Airline. This appears to have prompted Boeing to sanction the re-engineering of its 737, to be equipped with more fuel-efficient engines as opposed to developing a completely new offering (as Airbus had done with the A320neo). A decision which would save the company several $bn and pre-launch years.

However, grafting a new and heavier engine onto an existing frame created technical challenges. The 737 MAX’s larger CFM LEAP-1B engines had to be fitted further forward and higher up than in previous models, changing its aerodynamic response. The aircraft tends to pitch up at high angles of attack (AOA), making it especially vulnerable to stalling at take-off. To counter this threat, Boeing converted it’s Manoeuvring Characteristics Augmentation System (MCAS), which its KC-46 Air Force tanker already used. When this software-based safety system detects, via a sensor, that the aircraft is operating at a high angle of attack, it adjusts the horizontal stabilizer trim to force the nose down, thus reducing the stalling risk. As the MCAS compensation models the pitching behaviour of previous models, it reduced the need for significant pilot retraining.

Contrary to Boeing’s traditional safety system architecture there was no built-in redundancy as, although MCAS accessed 2 onboard computers, they were used on alternate flights and didn’t communicate with each other. Readings from a single faulty AOA sensor could therefore cause the MCAS system to pitch the nose downward and force the aircraft into a dive. However, Boeing VP Mike Sinnett stated however that this didn’t mean that MCAS was vulnerable to a single-point failure ”because the pilots themselves are the backup”. Indeed this rationale appears to have influenced both Boeing and the Federal Aviation Authority (FAA) when, following the Lion Air 737 MAX crash in late 2018 (all 189 on board killed), where MCAS was suspected of being a contributory cause, they both predicted the roughly 4,800 MAX in service over the next 45 years would suffer 15 if the MCAS was unchanged. A key assumption in the calculation was that pilots would react appropriately to any MCAS failure 99 times out of 100 (a two order of magnitude [2OOM] risk reduction). These numbers were part of the successful argument to keep flying the 737 MAX.

In March 2019, another 737 MAX, Ethiopian Air flight 302 crashed in similar circumstances (all 157 on board killed), prompting a worldwide grounding of the 737 MAX fleet. Subsequently, MIT professor Arnold Barnett, based on the loss of two aircraft out of only 400 delivered, estimated there would be 24 crashes per year for a fleet size of 4,800, thus the FAA (and Boeing) underestimated the risk by a factor of 72. Which is almost 2ooM. Co-incidence? Given the lack of training and other factors (stress, critical take-off period, limited time to react), it isn’t surprising that the pilot’s robustness as back-ups to a single point of failure was far lower than the FAA/Boeing assumption. Indeed, A former professor at Embry-Riddle Aeronautical University, Andrew Kornecki, who is an expert in redundancy systems, said operating with one or two sensors “would be fine if all the pilots were sufficiently trained in how to assess and handle the plane in the event of a problem”. But, he would much prefer building the plane with three sensors, as Airbus does.

In many of the High Hazard Processing Plant Risk Assessments which I have chaired, mostly Hazard and Operability Studies (HAZOP), we have also undertaken a Layer of Protection Analysis (LOPA). This is because, whereas HAZOP is entirely qualitative (reliant on the judgement of the review team), LOPA offers a degree of quantification. It is usually applied when the scenario is complex and/or the ultimate consequence is severe (fatalities). Accordingly, I am confident that the stakeholders used something similar (in fact, probably more rigorous) when reviewing the 737 MAX. Based on what has been written thus far, it would be reasonable to conclude that the 2ooM risk reduction ascribed to the pilots in the event that the MCAS malfunctioned was wildly optimistic and perhaps reckless.

However, I think I can empathise with the teams that made that call. Remember that the work was undertaken following the Lion Air crash where MCAS was strongly suspected as a contributory factor. At that time, Boeing had taken over 4000 orders for the 737 MAX, making it by far the most successful commercial airline venture ever and, according to Goldman Sachs, was expected to make up 1/3 of their revenues for the next 5 years – a cool $150bn. So, imagine the pressure on the review team as they worked the numbers in the review. I would be surprised if they weren’t aware of the fatal accident frequency threshold associated with the 737 MAX being deemed fit to continue to fly. I could also imagine that, on first pass, the numbers didn’t enable that threshold to be attained. What to do? Be the ones to imperil massive future revenue, profits and jobs? Or re-evaluate the numbers, being more cavalier with the assumptions. As LOPA is semi-quantitative and only delves as far as OOM granularity, I too may have been persuaded to reduce the pilots failing to act effectively from 1 in 10 to 1 in 100. Another few tweaks like that and the numbers may have enabled the threshold to be reached. Job done.

There is learning here for me and my peers. Firstly, resist starting with the end in mind – pre-determining the safeguards needed to counter credible scenarios. One technology provider I work with does this and, for the most part, it is effective as well as efficient because they are providing established and tested facilities. Nevertheless, even for a replica design, the context will always be unique (client sensibilities, geography, ambient conditions etc). It is then understandable if a review team to goes along with the calculations a competent designer presents. Recently, one of their clients refused to accept this approach, instead insisting we undertake the analysis from first principles using the client’s template. This proved challenging as, at least initially, the work was slow and laborious and engagement from the team fell away. However, as the scribe and I became more familiar with the procedure, things improved such that we were able to generate an answer that reflected the client’s operational experience and only then cross-referenced with the technology provider’s pre-determination. In most cases there was agreement. However, where there wasn’t, we tended to veer towards the more conservative outcome.

Ultimately, whether you’re reviewing the putative safety of an Ethylene Cracker or a series of commercial aircraft, the numbers matter, but perhaps not as much as the behaviour of the people who have to decide which numbers they will choose.

Oh, and by the way, at this time Boeing has only lost its king, not yet the kingdom.