Sunday, May 24, 2015

Elaboration Station

One of the great niceties of programming is that, fundamentally, programs execute one statement at a time, plodding along in a nice, neat, well-defined order.  It begins at the first instruction and ends at the last, and nothing out of the ordinary (threads and interrupts notwithstanding) messes that up.

Well, almost nothing.  Programs have but a single entry point, but most languages, Ada included, have the ability to declare objects at a scope just above the main subprogram.  We normally call these global variables, or library-level objects, or module level objects, or by whatever other name your particular flavor of language calls them.  But by any name, these objects get created before the main program starts, and live (just) past where the program ends.

(Note: I will talk exclusively about initializing 'global' variables in regular units, regardless of whether they are in the spec or a body.  As always, this is an over-simplification, and ignores things like child/parent packages, and other elaboration time code like library level tasks or packages containing initialization code.  However, the rules are essentially the same).

This opens up an interesting loophole in the "start at the beginning" philosophy, because this means there are actually TWO beginnings to any program.  The program has to go through and initialize all these global variables to whatever their initial values are supposed to be, before the main function begins in earnest, so that when it comes time for the 'real' beginning, everything is as it ought to be.

But in any modestly complex language, this is a much more subtle and difficult problem than at first it may appear.  Usually we just initialize constants to values, such as "pi : constant Float := 3.14", and there are no particular concerns, or for that matter anything that the program actually has to do at runtime (the compiler can normally handle these sort of static objects itself as a hardcoded variable in the file).  But, as always, general purpose languages have general purpose problems.

We have to consider the unpleasant fact that these global objects can also be initialized from other expressions apart from numeric literals; particularly, we might initialize them from other objects in other packages, or from the results of function calls (among others). For example:

package X is
  R : constant Float := P1.Calculate_Radius;
  C : constant Float := 2*P2.pi*R;
end X;

Here, we initialize R from the result of a function in another package (P1), and then initialize C via a calculation using both the previous object R and our declaration of pi in a third package (P2).  We can quickly see this start to spiral out of control, because what can be said about the Calculate_Radius subprogram?  It might execute any arbitrary code at all, perhaps prompting the user for a value, or communicating with a satellite to retrieve a value from the Mars rover.  In general, we might execute any code at all.

The issue at hand is that we've got all these packages with all these objects that need initialization, and no good way to figure out the order in which to do it.  For instance, in the above example, we need to ensure that P2 (with the 'pi' constant) is initialized before we try and use it to calculate C (or else the program is nonsense).  Of course, if the calculation of pi depends on some other unit, we have to do that unit before P1, before X, and so on and so on.

To help untangle this mess of "elaboration order", Ada has basically two rules governing the order in which the compiler should go through and initialize all the units:

  1. A package spec must be initialized before its corresponding package body.
  2. The package spec of any 'withed' unit must be initialized before the unit that 'withs' them.

But the perhaps unexpected thing about these rules is not just that you end up with several different potential elaboration orders, but that not all of them are necessarily correct.  To account for this, Ada also mandates that if the order the compiler selects is wrong, an exception should be raised when you start the program and stop things dead in their tracks.

This raises two interesting questions.  First, why don't we have rules that make sure the order is correct in the first place, and second, why do we get a run-time exception instead of a compile-time failure?  After all, at first blush, these rules seem pretty good.  A global variable can't be initialized from anywhere except either something in its spec, which is gaurenteed to be there by rule #1, or another unit that is 'withed in', which is handled by rule #2.  So what gives?

If you go back and re-read rule #1 like a language lawyer, you might note a conspicuous omission.  It requires that a unit's spec must be initialized before its body, but not immediately before, only at some point before.  The compiler is allowed to try and elaborate any number of other units between a spec and its body, and this creates interesting loopholes.

For instance, suppose a global variable in the body of A is initialized to the result of a function declared in B.  Ada's built in rules require the following:
  1. <A> is elaborated before [A] (rule 1)
  2. <B> is elaborated before [B] (rule 1)
  3. <B> is elaborated before [A] (rule 2)
 
Note that both the following orders are legal in accordance with our rules:
  • <B> - <A> - [B] - [A]
  • <A> - <B> - [A] - [B]
but the latter one would result in the dreaded "access before elaboration".  We initialize the body of A, including the global variable, which executes code from [B] which has NOT been initialized yet, and we have a flawed program with undefined values.  In the first one, however, [A] is not initialized until after [B], and everything is acceptable.

Why then can't we just tweak the LRM to specify that a body should be elaborated DIRECTLY after its spec?  It would appear that such a modification would be a good thing, since it would narrow down our choices to just a single, correct order:
  • <B> - [B] - <A> - [A]
The problem is that, like we said before, general-purpose languages have general-purpose problems.  Such a modification would be far too restrictive, since it would impose a strict hierarchical structure for a program.  We could not, for instance, have two mutually recursive functions in different packages, or have finite state machines, or many other interesting and fun software solutions that involve code in two different packages calling each other at elaboration time.

If we were to add such a 'direct' elaboration rule, these case examples would become illegal, despite not having any elaboration problems.  Consequently, we have to leave some bit of ambiguity to the order so that legal albeit niche programs are still possible.

But while this answers the question of why the LRM cannot specify rules that work for every program in every situation, it doesn't really answer why they can't put the onus on the compiler to find an order that works on a program-by-program basis.  The compiler has to analyze the source code, right?  Why can't we program it to be smart enough to check for elaboration dependencies as it goes, to narrow down the potential list to one that's correct?

For example, let's take the same example from above.  The compiler identifies the global variable in the body of [A], identifies it as being initialized from a call to <B>, and can in theory prune down the list of potential elaboration orders so that only those with [B] ahead of [A] are acceptable.  If we have some sort of untenable situation, such as a global variable in both [A] and [B] that are initialized from functions in <B> and <A> respectively, then we can spit out an error indicating that no correct order is ever possible, and thus prevent bad programs from ever being ran at all.

Again, at first glance this seems like an acceptable rule and in fact early drafts of Ada actually did require these sorts of checks.  But at the risk of repeating myself, general purpose languages have general purpose problems.  Don't forget, *we can run any arbitrary code* during elaboration, and nothing says that code has to work the same way every time.  It can have any manner of if/else blocks, which means the behavior (and success or failure) of the initialization can be drastically different.

Suppose we write the following function, which perhaps questionably puts the code for gladly paying people Tuesday for hamburgers today into the calculation of how many hamburgers liabilities we currently have:

function Hamburger_Debt return Integer is
begin
  if Day_Of_Week = Tuesday then
     Hamburger_Database.Glady_Pay_All_Lendors
     return 0;
  else
     return Current_Liabilities.Hamburgers;
  end if;
   
end Hamburger_Debt;
 
For or better or worse, this code is legal.  But if somewhere in another unit we have a global constant (perhaps "Starting_Hamburger_Debt") initialized to the return value of this function, this function gets executed during elaboration time, and real fun begins.

Now we are in real trouble, because the elaboration now depends on what day of the week it is when we start the program.  The compiler needs to elaborate the Hamburger_Database prior to whatever unit our function is in, but only on tuesdays.  On any other day of the week, we don't touch the hamburger database and have no care as to when it's elaborated, but in that case we need to ensure the Current_Liabilities package is elaborated prior.  The required elaboration order changes based on the date, which we obviously can't know until we run the program!

Alright, perhaps you argue the compiler should take a "worst case" approach, and say that any unit referenced at all, regardless of any control transfers, should be considered as a potential elaboration issue, and we should sort things out like that.  But then what do we do with something like this:

function F return Integer is
begin
  if false then
    return P2.X;
  else
    return 0;
  end if;
end F;
 
Obviously this is dead code and no elaboration order problem possibly exists (in fact, this entire subprogram can be optimized away).  But we just said the compiler should not perform this type of exhaustive analysis!  Should this code, statically known to be free of any elaboration order problems, potentially fail to compile because of a non-existent elaboration problem?

So we can see that trying to statically prove something that's inherently dynamic is a slippery slope.  The 'correctness' of a particular elaboration order can only be determined by somehow knowing the outcome of every if statement, which even in the best case means exhaustively testing all the possible combinations of states, and even then dynamically choosing how to elaborate packages as decisions are made.  Clearly this is hard, not just in the algorithmic sense but certainly in the programming sense, and it's unlikely any compiler could ever hope to implement it (and still have programs compile before the sun burns out).  We must either be overly strict and reject legal code (as is the case with accessibility), or be dynamic and account for the possibility of runtime failures (as with elaboration).

And so, at the end of the day, the compiler has no other recourse than to simply dump the responsibility of elaboration order on the programmer, but then double check at runtime that the order was sufficient (e.g. tracking the initialization state of each package, and verifying it has been set prior to access).  Various pragmas exist to let the programmer try to solve their own unsolvable problems, which perhaps we'll see in a future post.

But wait, there's more: GNAT stirs up the already murky water by deviating from the LRM-specified behavior by actually attempting to statically prove the correctness at the expense of being overly pessimistic.  The default behavior of the binder is to take the 'worst case' approach detailed above, and try to analyze the code for constructs that cause elaboration time dependencies (e.g. global variables).  Should they be found, it takes up the slack for the programmer and ensures the "right" order is used.  But like we said, this means GNAT is "overly strict", and many valid Ada programs fail to compile in GNAT due to elaboration circularities.  I suppose they feel the trade off is one worth making (and, of course, switches exist to enforce the less-strict Ada rules).

Like most non-standard GNAT behavior, this is both a blessing and a curse.  On the plus side, it means that the program can be proven, statically, to not fail the elaboration check, which means the check is obviously unnecessary.  No chance of failure and faster startup times are a good thing.

But at the same time, it also means your program is not necessarily portable between compilers, or for that matter even between versions of GNAT (their static checker gets better each time around, so dubious situations marked 'safe' in one version might suddenly no longer compile).  More importantly, though, it makes us all dumber.  Many programmers go their entire career without ever realizing elaboration is a thing or considering how it might fail, let alone inserting the appropriate pragmas and in some (rare) cases, actually making a concerted effort to take them out (true story).
      
Moreover, apart from being good at your job just for the sake of it, worrying about elaboration dependencies makes you code better.  It's not a coincidence that most of my outlier problem-causing examples seem convoluted; code is cleaner and better when these sorts of dependencies don't exist in the first place (Text_IO, I'm looking in your direction!).  Having to deal with them is the best incentive to restructure your code so that you don't have to deal with them, and GNAT is making us all lazy by cleaning up our mess.

So that's the problem; but what's the solution? To be continued...

No comments:

Post a Comment