Digital Economy Dispatch #114 -- Virgin Orbit’s Launch Failure and the Management of Risk

Digital Economy Dispatch #114 --Virgin Orbit’s Launch Failure and the Management of Risk15th January 2023

It was one of those days that you never forget. The 28th January 1986. I was a PhD student in the Computer Lab at the University of Newcastle studying the problems organizations face in large-scale software engineering. Walking onto the campus that day I was stopped by a colleague:

“Did you hear what happened? The space shuttle exploded just after it took off!”

The Challenger space shuttle had exploded just 72 seconds into its flight. It was a massive shock and a tragedy for all the people affected. However, I am embarrassed to admit that my first response to him was not one of sympathy for the crew and their families. It was a simple question:

“Was it a software error?”

Misguided as I was, it is a reaction that reflects where my studies had taken me at that time. I’d spent months crawling through the growing literature on large-scale system failures and examining case studies detailing cost overruns and delays deploying large government systems. As a result, it was becoming clear that we were increasingly relying on out-dated processes for creating software embedded in mission-critical systems. Due to a growing dependence on software as a key component of these systems, the challenges to build, deploy, and manage them were growing.

The World Runs on Software

Even as a young postgraduate student, I could see that errors and failure were commonplace in the vast number of complex digital solutions that were being deployed to run companies’ back-office functions, drive internet-based commerce, and manage critical parts of our national infrastructure. From energy production to aerospace and defence, a growing sense of concern was being voiced. Were there adequate checks into the processes by which these software-intensive systems were designed, constructed, and delivered? How could we ensure that such complex software and hardware solutions had been appropriately tested? What techniques should be used to evolve them to fix errors and in response to new needs? What vulnerabilities do they expose to those motivated to exploit them for financial, political, or other reasons? How can we protect ourselves from the many kinds of risks we face deploying such systems?

It is now more than 30 years since that terrible event. In that time, we have seen widescale digital transformation across many aspects of our lives, businesses, and social infrastructure to the point at which it is said that “software is eating the world”. Yet, recent events make us reflect on how far we have travelled in managing the challenges we face in deploying these large-scale digital systems, and what lessons we can learn about the road ahead.

A Failure to Launch

On 9th January 2023, almost 37 years after the Challenger disaster, we received another reminder of the risks developing and delivering largescale systems. With great excitement, the UK was hoping to become the first European nation to launch satellites into space when Virgin Orbit flew a modified jumbo jet from Cornwall to fire a rocket containing 9 small satellites to deploy into low earth orbit.

It ended in disappointment when an “anomaly” occurred that prevented the rocket from reaching the required altitude. Rather than place the satellites 555km above the earth, a malfunction stranded the rocket much lower in the sky. The rocket and satellites were lost and many people involved were left bitterly disappointed. Immediately after the failed launch, confidence in Virgin Orbit’s ability to live up to deliver on its plans for future launches eroded and shares in Virgin Orbit dropped by up to 22%, wiping over $150M off their market capitalization.

Thoughts quickly turned to what went wrong. The investigations into the failure have only just begun and will inevitably take some time to reach their conclusions. Many different factors will be taken into account until the problems are identified and specific recommendation can be made. Often the culprit can be pinned down to be a relatively small component (whether hardware or software) in these gigantic engineering efforts. For example, the deep investigations into the space shuttle disaster in 1986 ultimately placed the blame on the ineffective operation of a seal between two parts of the rocket booster.

However, just as importantly these reviews consider the process, practices, and policies that allowed these failed components to occur, avoid detection, and remain as unmitigated failure points in the system. For instance, in the case of the shuttle disaster, the 248-page final report concluded that there was a “broken safety culture” that enabled the poor design of the seal and which did not allow for redundancy if there was a problem. The risk was increased due to external pressures to go ahead with the launch in spite of the very cold temperature that morning.

With the failed Virgin Orbit launch it is far too soon to jump to conclusions and it would be foolish to do so here. Detailed reviews by experts will now be taking place involving many months of work to bring insight in to the technical aspects of what happened. However, such tragedies also provide an opportunity to consider the difficulties inherent in the processes and practices of delivering largescale system engineering activities. Based on personal observations and experiences working in a number of largescale software and systems engineering project over many years, I would suggest that 3 broad factors are important to consider in managing the risks in such projects: Cost, complexity, and coherence.

Three Key Management Risk Factors: Cost, Complexity and Coherence

In every large system, managing costs will inevitably play a major role in how the project operates Cost concerns directly and indirectly influence all key steps in decision making. For commercial ventures such as Virgin Orbit this is especially important. Founders, investors, shareholders, and employees all depend on the financial viability of the activities taking place. Plans and strategies depend on successful execution to build confidence and grow the customer base.

Unfortunately, advanced engineering projects in domains such as aerospace contain many unknowns. These are exacerbated when combined with rapidly changing economic, political, and social landscapes as we emerge from a global pandemic. The resulting “VUCA” considerations (Volatility, Uncertainty, Complexity, and Ambiguity) create a huge headache for the financial planning that underpins such efforts and can lead to intense debates between engineers and accountants about what is right and what is cost effective. As a result, in any review the trade-offs that have been made must always be examined to understand how those discussions have been handled and the criteria by which decisions were reached.

An important consequence of this context is that major programmes involving complex engineering can easily become overwhelmed by the difficulties of managing thousands of components, people, activities, processes, regulations, and so on. Project planning tools are filled with work breakdown structures, risk registers, and calendars of events. Walls are covered in Gannt charts and process flow diagrams. Offices are covered in post-it notes and task boards. It can often feel as if the project management and reporting responsibilities have expand to occupy more and more of the resources available.

In several large programmes in which I participated, the rising project management overhead became a severe source of tension across the teams. Paradoxically, rather than clarifying processes and responsibilities, this labyrinth of activities contributed to the complexity and slowed down progress. While everyone understood the importance of managing a vast array of artifacts and tasks throughout the lifetime of the systems under development, the difficulties navigating through established procedures meant facing a seemingly endless barrage of decision gates, reviews, assessments, and reporting schedules. Mistakes and misinterpretations were inevitable. Even worse, taken to extremes a dangerous perception emerges in which people feel compromised between doing “what is right” and “what is pragmatic”.

Such concerns are amplified when the systems being developed must integrate a variety of components from many different teams produced across a wide set of disciplines over extended periods of time. The challenge is not only how to coordinate such a large group of stakeholders. It is also essential to recognize that they may represent quite distinct companies, organizational domains, working traditions, and engineering disciplines. Each will demonstrate characteristics of a culture that informs their way of working and shapes the values by which they make decisions. As a result, maintaining consistency and coherence is an important consideration in many largescale engineering projects.

Although these concerns about cultural differences can be over played, there are important issues that have been highlighted in the past. Seminal case studies in largescale systems engineering such as Fred Brook’s “Mythical Man Month” and Tracy Kidder’s “The Soul of a New Machine” make particular reference to the negative impacts of clashes between hardware, software, and mechanical engineers. They describe how their project teams formed distinct and separate communities with their own specific ways of working. This is something that many would still recognize today. More recently, such conflicts have evolved to include differences over development practices (e.g., waterfall versus agile), management styles (e.g., hierarchical versus distributed), partnering approach (e.g., outsourced versus insourced), and so on.

The consequences of this lack of coherence are much more than philosophical or linguistic. It can have tangible practical impact. A good example seen in many large systems development projects occurs in system testing. In all such projects, the testing activities form a critical backbone connecting teams across many different disciplines. Without a common understand, misunderstandings, delays, and gaps in testing strategy can emerge. Clearly, a coherent and consistent approach to testing is vital for the success of complex largescale systems. Yet many of the differences in culture across teams is exposed only during final system testing and leads to a variety of disputes about how to execute appropriate testing practices.

To Infinity and Beyond

Largescale systems engineering projects in highly complex domains such as aerospace and defence are some of the most difficult undertakings attempted by any organization. The approaches they adopt and the paths they follow represent an important bellwether with far-reaching impact on many others.

Virgin Orbit’s recent launch of 9 new satellites should have been a cause of great celebration for the UK space industry. By mastering the complex software and systems engineering to fire a rocket containing the satellites into orbit, it also should have demonstrated that we’ve reached another milestone in our digital transformation journey. Unfortunately, it did neither. Instead, the “anomaly” that thwarted its success reminds us that to design and deliver largescale systems is fraught with risk. Difficult trade-offs must be made to balance 3 main concerns: Cost, Complexity, and Cohesion.