Home > Reliability > Should your application ‘Forgive!’ or ‘Fail fast!’ in case of a runtime error?

Should your application ‘Forgive!’ or ‘Fail fast!’ in case of a runtime error?

Whenever a run-time error occurs in an application there are two opposite error-handling approaches:

  1. The Forgive! approach: the application continues the execution and tries to do its best to circumvent the error;
  2. The Fail fast! approach: the application stops immediately and reports an error.

Which approach should you apply in your application?

To answer this question, let’s first look at a simple example.

Suppose we have to write a (rudimentary) web application that displays a warning message near a fountain to warn people that the water is polluted.

The following HTML code does the job:

<html>
   <body>
      <h2 style="color:red;">Important!</h2>
      <p>Please <b>DO NOT</b> drink this water!</p>
   </body>
</html>

The result displayed in the browser looks like this:

Now let’s insert a small error into the HTML code. Instead of </b> we write <b> after DO NOT, as shown below:

<p>Please <b>DO NOT<b> drink this water!</p>

Two interesting questions arise:

  1. What should happen?
  2. What will happen?

The second question is easy to answer. We just have to feed our browser with the buggy HTML code. This is the result (as displayed in Firefox 4.0, Google Chrome 11.0 and Internet Explorer 7.0):

Obviously, the Forgive! approach has been applied by the browsers because the application continued and did not report an error. The only difference to note is that more text is now displayed in bold. But the message as a whole is still displayed correctly and people are warned. There is nothing to worry too much about!

We can conclude: The Forgive! approach is good.

Let’s try another bug. Instead of <b> we write <b  before DO NOT, as shown below:

<p>Please <b DO NOT</b> drink this water!</p>

This is the result (again, as displayed in the browsers mentioned before):

Panic! Now the program does exactly the opposite of what it is supposed to do. The consequences are terrible. Our life-saving application has mutated into a killer-application (but not the kind of killer-application we all dream of writing some day).

We can conclude: The Forgive! approach is not good.

Remark: It is important to note that the above example is not just a theoretical, exaggerated example. There are numerous real-life examples of ‘little bugs’ with catastrophic consequences, such as the Mariner 1 spacecraft that exploded shortly after lift-off due to a ‘missing hyphen’. For more examples, see: List of software bugs.

As we can see from the above example, the consequences of applying the Forgive! approach vary largely and can range from completely harmless to extremely harmful. So, what is the correct answer to the important question “What should happen?”

As is so often the case, it depends on the situation. There are, however, some general rules to apply.

The first rule is:

During development you should always apply the Fail fast! approach.

The rationale behind this rule is twofold and easy to understand:

  1. The Fail fast! approach helps in debugging. As soon as something goes wrong, the application stops and the error message helps to detect, diagnose and correct the error. Therefore the Fail fast! approach helps to write more reliable programs. As a result, this considerably reduces development and maintenance costs and prevents frustrations and catastrophes that would otherwise risk appearing in production mode.
  2. As opposed to production mode, the consequences of bugs appearing during development mode are not harmful. The customer doesn’t complain, money doesn’t go to the wrong account, and rockets don’t explode.

In contrast, however, the situation changes radically when the application runs under production mode. Unfortunately, there is no one-size-fits-all rule. Practice shows that it is generally better to also apply the Fail fast! approach by default. The final damage resulting from an application that ignores an error and just continues arbitrarily is generally worse than the damage provoked by an application that stops suddenly (see also Murphy’s law). If an accounting application suddenly stops, the user is angry. If it continues and produces wrong results, the user is very angry. ‘Angry’ is better than ‘very angry’. Therefore, in this case the Fail fast! approach is better.

There are exceptions, however, and different situations must be studied carefully. This is especially true if the greatest possible damage requires us to examine each case thoroughly, such as in medical applications, money transfer applications or space invader applications. For example, applying the Fail fast! rule is obviously the right approach as long as a rocket to Mars didn’t take off. But as soon as the rocket has started, stopping the application is no longer an option. Now the Forgive! approach must be applied and combined with ‘do the best you can do’ behavior.

A good option is sometimes to fail fast, but minimize the damage. For example, if a run-time-error occurs in a text editor application, the application should first automatically save the current text in a temporary file, then display a meaningful message to the user (“Sorry, … but your data is saved in file abc.tmp”), optionally send an error report to the developers, and then stop.

We can conclude:

  • In development mode you should always apply the Fail fast! approach.
  • In production mode:
    • You should generally favor the Fail fast! approach by default.
    • Critical applications that risk creating high damages in case of a malfunction need customized, context-specific and damage-eliminating (or at least damage-reducing) behavior. Forgive and react appropriately! approaches must be applied in well-defined cases.

In this context it is also good to remember command number 6 of the 10 commandments for C programmers, written in old English:

“If a function be advertised to return an error code in the event of difficulties, thou shalt check for that code, yea, even though the checks triple the size of thy code and produce aches in thy typing fingers, for if thou thinkest “it cannot happen to me”, the gods shall surely punish thee for thy arrogance.”

In any case, your best friend is always a development environment that supports the Fail fast! principle. For example, a compiled language supports the Fail fast! rule because compilers can immediately report a whole plethora of bugs. Here is an example of a stupid bug that easily escapes the human eye, but which is caught immediately and with 100% certainty by the compiler:

var table_row_index = 1
...
table_row_indx = table_row_index + 1

Static typing is another example of a Fail fast! feature. This is because type incompatibility errors are all detected and must be corrected before the application can be executed.

Personally, I had to learn the importance of the Fail fast! rule the hard way. When I started to develop ERP software many years ago, I simply ignored the Fail fast! rule. Instead, I applied a ‘Forgive and hope that everything will be fine!’ rule. The (unanticipated) consequences were inevitable: angry customers and very angry customers, frustration and loss of time and revenue. I survived, but after long sessions of pondering I finally decided to develop a programming language with the main goal of embedding the important Fail fast! rule directly into the language itself, and to apply it consistently, not only in the language, but also in the libraries. The result is Obix. Obix incorporates over 20 Fail fast! features which are all realized by applying the following principle:

Errors should preferably be automatically detected at compile-time, or else as early as possible at run-time.

Obix is still in beta version, but for more information you can read the FAQ or chapter More reliability in the article Why Obix?

Comments, ideas and interesting links regarding the subject of this article are very welcome. Thank you.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: