A Discussion on Antifragile Software

Questions by Russ Miller and Bett Correa. Answers by Martin Monperrus, based on the paper “Principles of Antifragile Software”

Please explain what is meant by Antifragile by Taleb?
Antifragile by Taleb is a radically new perspective on errors. The classical thinking on errors is that:

they are bad;
they should predicted;
they should be removed – the utopia of error-free systems.

Taleb claims the opposite: 1) errors are good, 2) they should not predicted, 3) they should be encouraged. He presents many arguments and examples to support his claim.

First, for “errors are good”, let’s consider the immune system, if not continuously exercised with microbes (the errors), it becomes weak.

Second, for “errors should not be predicted”, the argument is that predictions are always eventually broken, because we live a highly complex and chaotic world. When predictions turned to be false, such as for instance the maximum height of tsunami on Japan coasts, this becomes a catastrophe (Fukushima).

Third, on encouraging errors, Taleb says that there is an intrinsic value in errors. For instance, all dead Silicon Valley startups can be considered as failures. However, beyond their failures, they all contribute to the world with something very good: they prepare a market for the followers, they leave an idea, a technology that will later flourish, etc.

All those three points, errors are good, they should not predicted, they should be encouraged, have never been put together, and so crisply. And it gives the simplest definition of antifragility: “the antifragile loves errors”.

What about this book/concept inspired you to write the paper?
Antifragile contains examples in many different domains. For instance, I already cited the example of immune systems. Taleb likes a lot to discuss finance, politics, medicine. However, what he says has a lot to do with engineering, and many readers have seen this.

I’m a software engineer, fascinated by bugs – that is, “errors” – for years. I’m researching in that field. When I read all this dicussion about the nature of errors, the classical perspective against the antifragile perspective, it was kind of a shock. And I started to consider that “the antifragile loves errors” can be interpreted as an engineering principle. I started to revisit the classical perspective on software errors, I looked for papers and books along this new line of thought, found very few of them, and decided that I had to write this paper and start researching in this direction.

Don’t we already have this concept? (e.g. high availability)?
No. Classical reliability engineering is about engineering robust software systems, and high availability is along this line. The word resilience is on the hype, it may mean robustness against catastrophic errors, but it is not the “loving errors”. So no, there is no well-known equivalent concept. However, I found very related ideas in some early papers. In a 1975 paper, Yau and Cheung from Northwestern University, suggested to insert ghost planes in air traffic control systems, if the ghost plane lands safely, it says something about the ability of the system, the operators and other planes to resist to pertubations. Inserting a ghost plane is rather close to artificially creating an errror. However, this is a very isolated case in an engineering landscape where 1) errors are considered bad 2) they should predicted 3) they should be removed.

You talk about the triad from Taleb’s book, fragile, robust, and antifragile – can you give examples of each with respect to software to help map the concept more clearly to software?
Of course! Last december, one day, my software development environment was suddenly broken: no way to start it, the program crashed just after booting it, with no clear errors. It took me weeks to understand the problem and be able to work with that program again. The reason is that I installed in my browser a plugin for chatting with my colleagues in Norway. Through a complex chain of dependencies between my browser and my development environment , and because of a stack of fragile software, this chat plugin broke my development environment. This is simply pure fragility. Indeed, everybody has a good story about an example of software fragility, for example in the 80ies and 90ies ending the famous “blue screen of death”. There is great imbalance between fragility and robustness, we always remember a single failure, but it’s hard to recall an example of a software application that properly works in all circumstances, a robust one. On kind of robustness specific to software is with respect to time. Software does not physically decay. However, over time, they are used in a number of different environments, for different usages, if software is used for decades, it is robust over time and changes, and this is very impressive. The TeX typesetting system, for instance, in one such system.

How does antifragility apply to software?
Only the future will give us the answer. In my paper, I make the difference between antifragility and the product (the code and the execution infrastructure) and antifragility and the process. On the first one, the product, I think that applying antifragility to softwarf means adding a lot more randomization at all levels of the software stack (operating systems, virtual machines, libraries, …) and engineering a new generation of fault injection systems (“the antifragile loves error”, again).

On the process side, I hypothesize that antifragile development processes are better at producing antifragile software systems. What would be an antifragile development process? Many blog posts make a link between antifragility and agile development. I agree. The antifragile is not afraid of errors: strong continuous testing, and continuous deployment go in that direction. The antifragile is no top-down approach as agile: short iterations to have a feedback loop from the user or the market. And having self-organized development team is completely an “organic process”, in a sense that Taleb would probably appreciate. However, this is not enough. “Fault injection in the process” is the next step, I’ll discuss it the next question.

What are some techniques for making software antifragile? (architecture, design, tools, process)
Technically, I currently think that the best technique for making software antifragile is fault injection in production. This means for instance randomly crashing servers as done by the well known Simian Army (a fault injection framework popularized by Netflix). There are many kinds of fault injection in production, from flipping random bits in memory to dropping network requests and messages, to shutting down complete datacenters (“Chaos Gorilla”). We are only at the beginning of engineering fault injection systems. The key idea behind fault injection in production is that it constantly exercises the recovery code, which then does not rot or decay. And it also strongly encourages the engineers to write good error-handling code.

Then on the process side, I’m a firm believer of the truck factor from agile development. Your project must survive even if a truck runs over half of your team. This means that knowledge must be shared, as well as code ownership. One way to ensure this is to “simulate the truck”, by making sure that people move often across projects. “Simulate the truck” means fault injection in the process. They are other paths: restaffing key people, recruiting a very bad developer, random modification of the documentation, etc.

Taleb talks about iatrogenics, and not making the patient any sicker–is this a concept that maps to software?
Absolutely. Fault injection in production must be appropriate. No company can afford that 80% of its servers crash every hour. Indeed, one must characterize the dependability losses (due to injected system failures) and the dependability gains (due to software improvements) that result from using fault injection in production. This is probably the key part of antifragile engineering.

Any recent trends that you see leading to more antifragile software (e.g. microservices)?
Today, there is a clear trend towards advanced robustness. The “design for failure” motto of the cloud people really goes in that direction. But this is not antifragility per se.

The main trend towards antifragility is certainly the Simian Army. It is very visible and contributes to propagate the disruptive idea of fault injection in production. I see no other trend that is so strong and directly related to antifragile software. Microservice is an interesting and rich concept. It is for sure directly related to robustness: many small services and asynchronous communication is always better to avoid global crashes that happen in monolithic applications. But this is not yet antifragility. Microservices tend to come with a strong monitoring infrastructure, and this sense of self is good for adaptive and self-learning fault tolerance. But the key relation between microservices and antifragility is probably that they provide a natural granularity for fault injection: one simply randomly shuts down microservices.

More on http://www.monperrus.net/martin/antifragile-software