Wednesday, May 29, 2013

The Vindication of Joel Spolsky

A long time ago, my friends I would often frequent bars and clubs with one express purpose: getting completely black-out drunk and going home with some easily-duped girl.  As it is when you are a group of twenty-something engineers going to a club, usually one of either two things happen: you don't get admitted at all, or, if you do, you are escorted out before the end of night due to an unacceptable B.A.C.  Ah, it was the blurst of times...

Around the same time, in one of the more notable posts by one of the more notable software bloggers, Joel Spolsky explained his stated policy to "never throw an exception of [his] own".  Those words were almost immediately pounced upon by programming world at large, some going as far to muse that perhaps his site had been hacked or he was under the influence of some sort of narcotic while writing.  He compared them unfavorably to a goto statement, and they compared him to an out-of-date stick-in-the-mud.  And the war between exceptions and status codes raged on.

This is sad, because it shows just how completely mystified software developers are when it comes to error handling. No, not just Mr. Spolsky, and no, not just the blogosphere, but pretty much everybody on both sides of the war.  It's no wonder why, after reading that post, PC software is mostly garbage.  Sadly, both groups are dead wrong about everything.  The correct answer when someone asks you if you prefer exceptions or status codes is neither.

Ironically, had all programmers involved spent more time indulging in vice instead of waxing pedantic about error handling mechanisms, the world would be a much better place.  Fewer bugs, more features, happier customers, happier programmers, happier managers, and more regretful young women trying to sneak out the next morning.  Everybody wins, except of course the regretful young women.

If you go all the way back to 1976 and read the seminal paper on exception handling [Exception Handling: Issues and a Proposed Notation], it groups the ways in which a subprogram can fail into two basic categories: Domain failures and Range failures.  A domain failure occurs when "an operation's inputs fail to pass certain tests of acceptability", whereas a range failure occurs when "an operation either finds it is unable to satisfy its output assertion, or decides it may not ever be able to satisfy its output assertion".  In the modern lexicon, we normally call them preconditions (conditions that must be met for the subprogram to start) and postconditions (things that must be true for the subprogram to have completely successfully).

Those are sort of abstract definitions for our purposes, so let's put things into a more relatable sense.  For me and my reprobate friends, the operation is, of course, Get_Laid.  The precondition is Admitted_To_Club, and the postcondition is P_In_V. The function body would look something like this pseudo-code:

begin
   while Is_Sober loop
      Consume_Martini;
   end loop;

   while Rejected loop
      Talk_To_Girl;
      Consume_Martini;
   end loop;

   Take_Girl_Home;
end;

Easy enough, right?  But anyone who's tried this particular approach knows it's an idealized version of a very complex task.  As mentioned before, we often either get stopped by the bouncer at the door, or tossed out mid-subprogram due to loud, uncouth, or generally lecherous and lascivious behavior.  If we never get into the club at all, a precondition has failed and the operation never starts at all.  If we get escorted out halfway through, a postcondition cannot be met and the operation ends early.

But what brought death into our programming World, and all our woe, is that the difference between domain failures and range failures is in the eye of beholder.  If we don't get admitted the club, then clearly we are not getting laid tonight, the output assertion cannot be met, and as such could be considered a range failure.  Similarly, we might say that staying reasonably sober is a precondition to going out, and that someone getting ejected means a domain failure has occurred.

Take, for instance, the venerable strtol() function, which converts a given string to its integer representation.  This has the obvious requirement that the supplied string represent a valid in-range integer for the subprogram to convert.  So, supposing that the client supplies a string that doesn't represent an integer, do we consider this a domain failure or a range failure?
  
On the one hand, you can take the overly-narrow view that an interface contract is something defined exclusively by the language.  The function declaration requires a const char *, and that's all there is to it (or, as a coworker once argued, "anything else is a comment, not a contract").  So long as the client supplies a valid pointer to an array of characters, the interface has been pedantically met, and the input assertions have passed.  If the string cannot be converted, you've got yourself a range failure.

Though, on the other hand, you could take the overly-broad view that an interface contract is something conceptual defined by the developer.  The domain of the function is not simply an array of characters, but characters representing a valid integer.  A language that is not expressive enough to establish this condition (apart from a comment in the header) is a failure of the language, not the code.  If the string cannot be converted, it's because a precondition wasn't met, and you've got yourself a domain failure.

But don't think for a moment that this is just a semantic debate.  Depending on your point of view, the same exact failure has completely opposite ramifications.
  
The consequences of a domain failure are quite simple: there is a programming error in the calling code.  It was the client programmer's responsibility to ensure the condition was met before the call, but that wasn't done, and so the calling subprogram is at fault and must be fixed.  Simply put, there is a bug in the calling subprogram.  Unless your code has become self-aware, the only thing to do about it is crack open the source, fix the flawed subprogram, recompile it, and redistribute it.  This means reporting the issue back to the caller is, generally speaking, useless.  After all, we know the code cannot be trusted to properly verify its inputs, so we certainly can't expect it to properly verify the outputs (though you would be shocked how often programmers expect this...).
  
The consequence of an output failure is a horse of a much different color: the subprogram failed because of some condition that could not be detected until the middle of the call itself.  In this case, the caller doesn't know the condition exists yet, and so it must be reported back so they can evaluate the failure and resolve it.  This is not necessarily a bug in the code, since presumably the calling subprogram can (potentially) fix the problem and try the operation again.  We are essentially returning a status back to the caller, which may be good, bad, or indifferent: the onus is on them to decide what to do.  The form of this status can be either a return value, an exception, an ON ERROR handler, a 'long jump', or any number of other forms.  The point is that we must alert the caller somehow so he may take corrective action.

And this is where things get very, very complex.  Suppose, as before, that someone passes a string to strtol that doesn't represent an integer.  Is that a bug in the calling code?  Or is that a condition that has to be reported back to them so they can fix it?  Is the program executing as expected, or do we have undefined behavior?  Are things ok or not?  How is it that the same condition can mean two totally opposite and opposing things?!

So we start to see two competing methodologies emerge.  We've got the pessimistic "assert style" domain failures, that should never occur in a properly written program and, if they do, indicate a static programming error that can never be handled at runtime ('handled' in the sense of 'fixed').  Then we've got the optimistic "status style" range errors, which involve returning information to the client (in some unspecified form) so that he can take corrective action and attempt the operation again.  And back in the primitive days of C, this was how programs were written: assert macros checked for domain errors and status codes indicated range errors.

But assert macros had the unpleasant side effect of dumping the entire program, all at once.  No warning, no chance to clean up, no ability to continue other tasks or log errors, continue in a reduced functionality mode, or event present an apologetic message to the user.  It just ends.  And while it's true that a domain error can never be fixed at runtime, we still might want to carefully avoid the error and keep other isolated portions of the program running, log some sort of error, continue with partial functionality, or other casualty action.

So if you are writing some general-purpose and reusable library or framework, you can't very well assert on anything since, after all, you don't know what action the application programmer wants to do.  If they are using your math library for keeping an aircraft in the air, an assert is a very bad idea.  So in the interest of generality and usability, most libraries and frameworks avoided asserts in favor of status codes.  Or, in other words, they turn all domain errors into range errors.  The intention wasn't that the client should always be able to "fix" the problem, the intention is to let the client decide the appropriate action which, in most cases, is just to assert.

But if the road to hell is paved with good intentions, it's also lined with status codes.  For whatever the reason, most programmers just followed the precedent set in the libraries, and kept returning status codes within their own programs.  Maybe it was from hasty, unplanned attempts at component-based development.  Maybe it was just cargo-cult mimicking of what other software already did.  Maybe it really was an honest attempt to write reliable code.  But in any case, the idea of a "domain error" became all but extinct.

So when C++ added exceptions, most just figured it was a better way to return a status code.  After all, one of annoying parts of using status codes is that you use up your one return value.  And low and behold, an exception lets you return both a status and a result, with a minimum of fuss.

But they were wrong.  The proper use of an exception is not for returning range errors, but a structured mechanism for returning domain errors.  It's like an assert with some semblance of control.  It's one (small) step up from a crash.  It's a better way to end a malfunctioning program, emphasis on malfunctioningAn exception is just a way to shut down a broke program.

The key to creating safe, bug-free programs that work is to reverse this trend: turn all range errors back into domain errors.  Looking for a better way to return status information to the client misses the forest for the trees: eliminate the need to return the information. The tricky part about this, however, is that it's not just some syntax icing to smear over your already-written subprogram: it normally involves a complete redesign of the program in question, if not the entire architecture of the system.

Let's revisit the code for our night of debauchery from above, and refactor it a little bit.  Currently, we are considering "getting thrown out of the club for being drunk and disorderly" as a range error.  The question is not "how do we report this?", but instead "how do we stop this from happening?"  The obvious way is the have a designated sober person to keep the alcohol intake of the rest of the group in check, so we can simply add a precondition to the start of the subprogram:

if not Group_Has_Designated_Chaperone then
    raise NO_CHAPERONE;
end if;

(Note, of course, these days you ought to use the 'Pre' aspect, but I write it like this to accommodate those still unfortunate enough to deal with C++.)

Note that we are still raising an exception, but it means something entirely different.  One of the preconditions for getting laid is to make sure the group has a designated sober person before we go out.  So long as this is true, there is no chance of us getting kicked out for being drunk, and everyone gets laid.

And no matter how many reasons you can come up with for not getting laid, there is always a way to avoid the situation.  Maybe the girls don't like our clothes?  Let's add a precondition of Dressed_To_The_Nines.  Maybe we can't find girls to talk to?  We have a precondition of Club_Has_Fat_Chicks.  Just keep adding preconditions until there is no possible way that the subprogram can fail to produce a result.  Our post condition (P_In_V) is always met!

Or, in other words, write a subprogram that can't ever possibly fail, so long as its preconditions are met.  Error handling is a proactive process, not a reactive one.  Engineering, software or otherwise, is about not having exceptional conditions.  It's about the system doing exactly what you expect, and nothing else.  A properly coded program should raise an exception at every possible point, but never ever have it happen.

This makes the error-handling debate a moot point, since there won't be any more errors to handle.  Once the client calls Get_Laid (with the appropriate conditions met), there is never any chance of it failing.  We can just call the subprogram, confident we will always get laid, and forget all about checking to make sure it did.  Moreover, if it does fail, recall from above that this indicates a static programming error, either in the client or the server, and so the only appropriate thing to do is clean up and end the program (or task, etc), which an unhandled exception will do for us automatically.

For a more practical example, consider the following code abortion:

begin
   loop
      read_next_byte;
      process_byte;
   end loop;
exception
   when End_Of_File => Put_Line "Success!";
end;

I suppose this is a valid way to code, but any project that tries is likely doomed.  If we want to read a file till its end, then reaching the end is clearly not an exceptional condition!  It's what's supposed to happen!  It's a required condition!  Contrast it with the following:

while not Is_EOF loop
   read_next_byte;
   process_byte;
end loop;
Put_Line("Success!");

Remember the difference between domain errors and range errors?  In the first case, reading past the end of the file is an ambiguous range error; it could mean the successful completion of the program, or it could mean we are stuck in an infinite loop because of some coding error.  In the second case, however, reading past EOF is always bug.  The task needs to shut down immediately, because the program is broken, and exceptions not only shut things down but tell you the line number of why.  It's as if the compiler is doing the debugging for you!

It's at this point that someone always tries to point out a situation where this isn't possible, and where you just have to use an exception.  These people are wrong.  That's not to say there are conditions where it's more convenient to use an exception, or where you can rationalize using an exception, or where you are forced to use an exception to interface with some other component, but that's a far cry from it being impossible.

A common one is a heap allocation. You use new to allocate a chunk of memory, but if there isn't enough room, you get a Storage_Error.  There's no way to see if there is enough room on the heap before you make the call, so apparently we have to return this to the caller to let him deal with it after the fact.

But that's not really true, is it?  The default heap has no mechanism to ascertain the remaining amount of space, but who says we have to use the default heap?  Is it impossible to write our own user-defined storage pool and add a Remaining_Bytes subprogram so that clients can ensure there is enough room before trying the allocation?  Not exactly easy, but then again who said writing safe code was easy?

Another supposed example is multithreading.  If we want to have multiple threads access our fancy new custom heap, adding the requirement to have enough room creates a nasty race condition: two threads both pass the precondition, and then one of them gets the last remaining few bytes, while the other fails.  (Note, however, that this is exactly what is supposed to happen: the exception alerts us to a static programming error as soon as it happens, instead of trying to diagnose the race condition by hand).  But this simply means our fancy heap must include a Lock_Heap and Unlock_Heap subprogram.

So yes, all this means that instead of writing this:

x := new Integer'(42);

it balloons into this:

Lock_Heap;
if Remaining_Bytes > Integer'Size/8 then
   x := new Integer'(42);
else
   Do_Something_Else;
end if;
Unlock_Heap;

Like before, an exception in the first case is ambiguous.  Does a Storage_Error indicate a misbehaving memory leak that should never happen, or is it a "successful" detection of a low memory condition?  In the first case, there's no way to tell.  Sometimes it's success, sometimes it's failure.  But it's a moot point in the second, because it never happens.  We don't react to low memory conditions, we prevent them before they happen.

So was Joel Spolsky right?  Far from it.  He's just a pragmatic programmer, fed up with decades of C++ programmers crying wolf, trying to fake success by misusing a mechanism designed to indicate failure.  But like he said, you do need to be able to read the code.  Exceptions are invisible, they do obfuscate the control flow, and they are a precarious way to return information to the user.  They are just like a goto, but that's okay, because the only place it's supposed to go to is directly to the end

So yes, use exceptions.  Throw everywhere you can, but don't ever catch. You'll be surprised by not only how reliable your programs start to become, but how much more successful you can be at picking up chicks.  And really, isn't that what it's all about?