Tuesday, February 19, 2019

249: The Other Half of the Battle

Audio Link

You may have read in the news recently of the death of Roger Boisjoly, one of the engineers who was involved in the development and launch of the space shuttle Challenger..  That shuttle exploded in midair back in 1986, killing seven astronauts and irreparably damaging the U.S. space program.   Most likely, the article you read talked about how Boisjoly and his colleagues predicted that the “O-ring” joint on the shuttle would fail due to the cold temperatures, and desperately tried to convince their management to cancel the shuttle launch, only to be overridden and forced to helplessly watch the mission fail.   Often this is seen as a parable about noble science and math geeks defeated by greedy and self-interested managers, who simply aren’t as smart, are too motivated by selfish concerns, or have cavalier attitudes towards sacrificing other people’s lives.   But, as is often the case in life, the story isn’t really that simple.   In particular, in the analysis by well-known data scientist Edward Tufte, this was a case where the math was valid, but poor communication of the math was ultimately at fault.

To review the basic outline of the event, the space shuttle was launched on a cold day in January 1986.   Boisjoly and his colleagues did an analysis of the failure rate of the o-ring joints in relation to the local temperature, since cold weather was predicted.    There had never been a launch in temperatures as low as those predicted that day, in the low 30s Fahrenheit.   The engineers predicted that there would be a significant risk of O-ring failure, so, as Boisjoly wrote, they “fought like Hell to stop that launch.”  They met with their local managers at Morton Thiokol, who agreed there was some concern, so quickly faxed 13 charts illustrating the data to their contacts at NASA, along with a recommendation not to launch.   This was Thiokol’s first no-launch recommendation in 12 years.   NASA pushed back, saying they were “appalled“ by the recommendation, and managed to convince the Thiokol managers that the risk was acceptable, so they reversed their decision.   Then, soon after launch, the shuttle blew up.

Tufte’s analysis focused on those 13 charts that the engineers sent to NASA.   While the data was accurate, were the charts convincing, and was the accurate data clear enough for managers to interpret?    Essentially many of the charts were just columns of numbers, full of lots of details that weren’t entirely important for the current discussion.   For example, one chart lists historical levels of damage measured in O-rings from returned shuttles, without relating it to the temperatures, which are listed elsewhere.   Rockets are referred to by different names in different places— NASA ID numbers, Thiokol ID numbers, and launch dates— making it really hard to cross-reference data.   Possible damage is broken down into six types, without consolidated information on total O-ring damage from each cause.    And while they point out in one chart that the lowest-temperature launch had an unacceptable amount of damage, they don’t clearly relate temperatures to damage in a general sense, leaving a single anecdote as their most critical argument.  

Tufte points out what he believes would have been the most effective way to communicate the concerns:  a direct plot of O-ring damage vs temperature.    When such a graph is drawn, with correct proportional spacing between the temperatures listed, a clear curve that slopes rapidly upwards towards the left end, where the temperatures are lowest, becomes visible.   From such a plot, you can infer at a glance that the risk of launching in 30 degree temperatures would be astronomical.   Yet this simple, direct argument was not included in those critical 13 Thiokol charts— it was theoretically implied by the totality of the data, but buried in the details.    

Tufte points out three major sins in data communication illustrated by this incident:
  1. Chartjunk— as Tufte puts it, “Good design brings absolute attention to data”.   Elements that are not relevant to the data you are trying to communicate, such as the breakdown of types for each piece of damage, or little pictures of rocket ships to make the graph more visually entertaining, only hurt the arguments the engineers were trying to make.
  2. Unclear Cause and Effect— We are naturally adapted for quickly understanding graphs with a cause on the X axis and effect on the Y axis, as in Tufte’s proposed temperature vs damage plot.   By trying to include various other types of information, and not clearly focusing on the most important cause and effect, the engineers ultimately hurt their cause.  
  3. Poor data ordering— In some of the critical charts, the flights were listed by date, which obscured the ultimate effect they were trying to illustrate, and made it very hard to see the relation between temperature and damage.   

Ultimately, this incident ended up portrayed in the media as a case of boneheaded managers messing up after being presented with perfectly reasonable data.   Famous physicist Richard Feynman summarized it as “For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled”.   But as we have seen, this is a gross oversimplification, and we have to assign some responsibility to those engineers who failed to properly communicate the mathematics.   Tufte’s summary adds a bit of nuanced insight to Feynman’s:  “Visual representations of evidence should be governed by principles of reasoning about quantitative evidence…  Clear and precise seeing becomes as one with clear and precise thinking.”

We should also mention that if you search online, you will find some who dispute Tufte’s analysis of this incident.  They claim that there are many other factors in the data that should have been considered, and it’s only with 20/20 hindsight that we can reproduce the precise temperature-vs-damage graph that seems so convincing now.   But it’s clear that the principles of data communication that Tufte points out are still valid in general.   If you are ever in a situation where you need to make an argument based on numerical data, think hard about issues like chartjunk, data ordering, and cause-and-effect, to reduce the chance that one of your own projects will explode in midair.  

And this has been your math mutation for today.