Questions by Russ Miller and Bett Correa. Answers by Martin Monperrus, based on the paper “Principles of Antifragile Software”
Please explain what is meant by Antifragile by
Taleb?
Antifragile by Taleb is a radically new perspective on errors. The
classical thinking on errors is that:
- they are bad;
- they should predicted;
- they should be removed – the utopia of error-free systems.
Taleb claims the opposite: 1) errors are good, 2) they should not predicted, 3) they should be encouraged. He presents many arguments and examples to support his claim.
First, for “errors are good”, let’s consider the immune system, if not continuously exercised with microbes (the errors), it becomes weak.
Second, for “errors should not be predicted”, the argument is that predictions are always eventually broken, because we live a highly complex and chaotic world. When predictions turned to be false, such as for instance the maximum height of tsunami on Japan coasts, this becomes a catastrophe (Fukushima).
Third, on encouraging errors, Taleb says that there is an intrinsic value in errors. For instance, all dead Silicon Valley startups can be considered as failures. However, beyond their failures, they all contribute to the world with something very good: they prepare a market for the followers, they leave an idea, a technology that will later flourish, etc.
All those three points, errors are good, they should not predicted, they should be encouraged, have never been put together, and so crisply. And it gives the simplest definition of antifragility: “the antifragile loves errors”.
What about this book/concept inspired you to write the
paper?
Antifragile contains examples in many different domains. For instance, I
already cited the example of immune systems. Taleb likes a lot to
discuss finance, politics, medicine. However, what he says has a lot to
do with engineering, and many readers have seen this.
I’m a software engineer, fascinated by bugs – that is, “errors” – for years. I’m researching in that field. When I read all this dicussion about the nature of errors, the classical perspective against the antifragile perspective, it was kind of a shock. And I started to consider that “the antifragile loves errors” can be interpreted as an engineering principle. I started to revisit the classical perspective on software errors, I looked for papers and books along this new line of thought, found very few of them, and decided that I had to write this paper and start researching in this direction.
Don’t we already have this concept? (e.g. high
availability)?
No. Classical reliability engineering is about engineering robust
software systems, and high availability is along this line. The word
resilience is on the hype, it may mean robustness against catastrophic
errors, but it is not the “loving errors”. So no, there is no well-known
equivalent concept. However, I found very related ideas in some early
papers. In a 1975 paper, Yau and Cheung from Northwestern University,
suggested to insert ghost planes in air traffic control systems, if the
ghost plane lands safely, it says something about the ability of the
system, the operators and other planes to resist to pertubations.
Inserting a ghost plane is rather close to artificially creating an
errror. However, this is a very isolated case in an engineering
landscape where 1) errors are considered bad 2) they should predicted 3)
they should be removed.
You talk about the triad from Taleb’s book, fragile, robust,
and antifragile – can you give examples of each with respect to software
to help map the concept more clearly to software?
Of course! Last december, one day, my software development environment
was suddenly broken: no way to start it, the program crashed just after
booting it, with no clear errors. It took me weeks to understand the
problem and be able to work with that program again. The reason is that
I installed in my browser a plugin for chatting with my colleagues in
Norway. Through a complex chain of dependencies between my browser and
my development environment , and because of a stack of fragile software,
this chat plugin broke my development environment. This is simply pure
fragility. Indeed, everybody has a good story about an example of
software fragility, for example in the 80ies and 90ies ending the famous
“blue screen of death”. There is great imbalance between fragility and
robustness, we always remember a single failure, but it’s hard to recall
an example of a software application that properly works in all
circumstances, a robust one. On kind of robustness specific to software
is with respect to time. Software does not physically decay. However,
over time, they are used in a number of different environments, for
different usages, if software is used for decades, it is robust over
time and changes, and this is very impressive. The TeX typesetting
system, for instance, in one such system.
How does antifragility apply to software?
Only the future will give us the answer. In my paper, I make the
difference between antifragility and the product (the code and the
execution infrastructure) and antifragility and the process. On the
first one, the product, I think that applying antifragility to softwarf
means adding a lot more randomization at all levels of the software
stack (operating systems, virtual machines, libraries, …) and
engineering a new generation of fault injection systems (“the
antifragile loves error”, again).
On the process side, I hypothesize that antifragile development processes are better at producing antifragile software systems. What would be an antifragile development process? Many blog posts make a link between antifragility and agile development. I agree. The antifragile is not afraid of errors: strong continuous testing, and continuous deployment go in that direction. The antifragile is no top-down approach as agile: short iterations to have a feedback loop from the user or the market. And having self-organized development team is completely an “organic process”, in a sense that Taleb would probably appreciate. However, this is not enough. “Fault injection in the process” is the next step, I’ll discuss it the next question.
What are some techniques for making software antifragile?
(architecture, design, tools, process)
Technically, I currently think that the best technique for making
software antifragile is fault injection in production. This means for
instance randomly crashing servers as done by the well known Simian Army
(a fault injection framework popularized by Netflix). There are many
kinds of fault injection in production, from flipping random bits in
memory to dropping network requests and messages, to shutting down
complete datacenters (“Chaos Gorilla”). We are only at the beginning of
engineering fault injection systems. The key idea behind fault injection
in production is that it constantly exercises the recovery code, which
then does not rot or decay. And it also strongly encourages the
engineers to write good error-handling code.
Then on the process side, I’m a firm believer of the truck factor from agile development. Your project must survive even if a truck runs over half of your team. This means that knowledge must be shared, as well as code ownership. One way to ensure this is to “simulate the truck”, by making sure that people move often across projects. “Simulate the truck” means fault injection in the process. They are other paths: restaffing key people, recruiting a very bad developer, random modification of the documentation, etc.
Taleb talks about iatrogenics, and not making the patient any
sicker–is this a concept that maps to software?
Absolutely. Fault injection in production must be appropriate. No
company can afford that 80% of its servers crash every hour. Indeed, one
must characterize the dependability losses (due to injected system
failures) and the dependability gains (due to software improvements)
that result from using fault injection in production. This is probably
the key part of antifragile engineering.
Any recent trends that you see leading to more antifragile
software (e.g. microservices)?
Today, there is a clear trend towards advanced robustness. The “design
for failure” motto of the cloud people really goes in that direction.
But this is not antifragility per se.
The main trend towards antifragility is certainly the Simian Army. It is very visible and contributes to propagate the disruptive idea of fault injection in production. I see no other trend that is so strong and directly related to antifragile software. Microservice is an interesting and rich concept. It is for sure directly related to robustness: many small services and asynchronous communication is always better to avoid global crashes that happen in monolithic applications. But this is not yet antifragility. Microservices tend to come with a strong monitoring infrastructure, and this sense of self is good for adaptive and self-learning fault tolerance. But the key relation between microservices and antifragility is probably that they provide a natural granularity for fault injection: one simply randomly shuts down microservices.
More on http://www.monperrus.net/martin/antifragile-software