Saturday, July 13, 2013

Abstraction Extraction


Thirty years ago Fred Brooks wrote "The best way to attack the essence of building software is not to build it at all", but for decades, reuse has stubbornly refused to graft to the software lifecycle.  Programmers continually reinvent the wheel, or even worse, apply an open-source style of "copy and paste reuse" by tweaking what others have done (or sometimes just copying it verbatim), calling it 'new', and duplicating the workload for everyone.  Software cost is proportional to lines of code, yet we all wrong-headedly reward programmers for the amount of code they add!

For a long time, the concept of "object orientation" was heralded as the solution to this problem.  Programmers naively bought OOP languages, and some were shocked and appalled when their original problems remained (and often exacerbated).  A backlash inevitably occurred, and now programmers are either brainwashed cult members of a failed technology, or aged and lazy moldy-figs who refuse to embrace change out of fear and laziness.

Here's the rub: OOP represents about a shitzillion different programming concepts, all mashed together under one catch-all buzzword.  Ten different programmers will claim they use OOP, all have completely different styles, and all be technically correct.  But "technically" using OOP, especially in this piecemeal fashion, doesn't net you the massive, oversold gains that people have been promising for decades.  In fact, most programmers sprinkle a little OOP here, a few objects there, come up with a program that is no better than the original procedural program (and often much, much worse), and deride OOP for failing.

But used properly, OOP is great; however, almost nobody uses it right.  In fact, many programmers aren't even sure what it is, apart from putting your arguments before your subprograms.  Most definitions say that object oriented programming is programming oriented around objects, or some other form of self-referential nonsense.  Even "good" definitions can't boil it down any further than a list of no less than five different concepts, each a confusing buzzword in its own right.

And so for the most part, programmers are just confused.  They use OOP technologies to create obfuscated procedural programs that cost more but don't do more, and procedural programmers laugh at them behind their back.  A good example of this confusion is the Perl CGI.PM module.  It's often advertised as having "both" styles of programming: you can call it as a normal procedure:

print header;

or as a fancy object:

print $q->header;

As if these were any different at all!  If anything, the second one is worse, since it accomplishes the same exact thing in a less readable, more confusing way.  Yet time after time, people read that "OOP is better", and so programmers use the second method, and assume that their programs are now 'better' because they used OOP.  If all the great men of history had beards, does growing a beard make you a great man?  Then why would simply putting your variable before the function call instead of an argument make your program great?

So I will try to simply things:

OOP is programming by abstraction.

It has nothing to do with syntax, inheritance, dynamic dispatch, private data types, or anything else random books and tutorials have told you.  You can meet all the criteria listed on the Wikipedia page, but have a crappy, non-OOP program.  Or, on the other hand, you can have a strictly procedural program without any of the OOP features at all, and still have it be highly abstract and 'object oriented'.

And so the first conclusion we can draw from this, however counterintuitive, is that using OOP doesn't require an OOP language.  Abstraction is a style of programming, not a syntax, and you can achieve that in anything from Java to C to Ada83 to Ada2012 to assembly language.  The only difference is that some languages have different degrees of syntax support to help automate some of the more common techniques.  But somewhere along the line, this concept of abstraction got lost in the shuffle.

Historically, the problem has been pundits trying to carefully cut narrow lines around what features are on the imaginary checklist a language needs to meet to be classified as OOP.  Such a list doesn't exist, yet people still try and accuse 'their' language of not needing feature 'x' to be object-oriented, while claiming some 'other' language requires feature 'y'.  For instance, many Ada proponents say Ada83 was object oriented, dynamic dispatch be damned, since it had the private data types that C lacked.  Others derided Ada95's lack of 'postfix' notation as being non-OOP.

But these arguments miss the point: all OOP features, from private data types to postfix notation to dynamic dispatch, are just syntax.  It's the abstraction that matters, and there are no points for style.  A god-awful mess in the purist, most elegant OOP language around is still a god-awful mess, whereas an robust, abstract program written in assembly language is still a robust, abstract program.  The key is that using a so-called "OOP" language usually makes writing good, abstract programs easier (but certainly in no way prevents you from writing garbage).

So then, how do we achieve this wonderful abstraction?  Perhaps we should start by seeing what "abstract" really means:

abstract
adj [ˈæbstrækt]
     1. having no reference to material objects or specific examples; not concrete

Or, in other words, something that is 'abstract' is a general idea or concept, and something that is 'concrete' is a specific mechanism or item that achieves an abstract concept.  For instance, 'having fun' is an abstract idea.  'Banging a girl' is a concrete action that achieves that abstract idea (well, sometimes).

But right off that bat, we can see that one person's concrete action is another person's abstract idea, and vice-versa.  Before we said 'having fun' was an abstract idea, but that itself could be considered a concrete action of the abstract idea of "how to spend a Friday night" (along with "drinking alone", "watching TV" or "reading a book").  Or, on the other hand, "banging a girl" could be considered an abstract idea that encompasses the concrete actions of "rough sex", "kinky sex", "date rape", and so on.

So then what does this have to do with computer programming?   Why should we make our programs "abstract" at all?  What does it buy us?  Why even bother?

The key here is change.  Writing some code is just the start of the (professional) programming lifecycle, that normally involves fixing, updating, tweaking, enhancing, and (hopefully) reusing that code for the next twenty or even thirty years.  There is no such thing as "throwaway" code; there is only shitty code that gets written with the hope it will be thrown away, but ends up getting hacked up and reused over and over, much to the chagrin of the original programmer.

And changing code is the ultimate programming sin.  Every time you change code, even if it's just one character in one file, you open yourself up to all sorts of programming woe:
  • Is it still going to work?
  • Will all the other units still work?
  • Will we introduce new bugs that will have to get fixed later?
  • How long will it take us to even be able to understand the code to change it?
  • Will there be other 'problem areas' we will have to fix once we open it up?
  • Is the new change backwards compatible, or will we need separate versioning? 
  • Will we have to retest the entire program?
  • How much overhead (reviews, TPS reports, paperwork) will it take?
Of course, there are plenty of programming situations in which you are not beholden to any of this.  That cute iPhone game that will be obsolete in a week probably won't see much reuse, or justify too many bug fixes.  But if you are writing the avionics code for a Boeing 777, even the smallest change will set off a ripple effect of weeks worth of effort.

As a more hand-on example, let's look a trivial program that prints random numbers:

void F (int N)
{
  for (int i=0; i<N; i++) printf("%f\n", rand());
}

Easy enough, no?  But if we examine it closely, we can see we've played fast and loose with our requirements.  "Print random numbers" is an abstract concept for sure, but that's not what this program is doing.  In actuality, we are printing random numbers uniformly distributed between 0 and 1, because that's what rand() does.  That's a concrete implementation of our abstract concept.

This means if our requirement changes, we have to change this code, and changing code is a no-no.  What if we want to print number between 1 and 100?  Or exponentially distributed?  Or normally distributed between 0 and 500?  We don't want to have to keep changing this unit over and over, so we need to make it abstract, by 'removing all references to specifics'.  We can do this quite easily using a function pointer:

void F (int N, float (*rng)(void))
{
   for (int i=0; i<N; i++) printf("%f\n", (*rng)());
}

So now, instead of 'hardcoding' the function that generates our random number, we simply pass it to our subprogram as an argument.  Now the program is abstract, because it "prints random numbers".  If we want uniformly distributed numbers, we can pass in rand(), or we can pass in other (custom) functions that generate different ranges with different distributions.  Note that no matter how many times the requirements change, this code never has to.  And code that doesn't change is code that never breaks.

Of course, using parameters to provide general behavior instead of hardcoding fixed functionality is nothing shocking, and certainly doesn't qualify as OOP.  In fact, the above code is unlikely to be of any practical use because of one tricky problem: functions normally need data (i.e. arguments), and our abstract subprogram doesn't have them to give (by design).

Note that above, our function pointer is to a subprogram taking zero arguments.  It follows from this that any random generator we want to use must also take zero arguments (or, more generally, they all must take the same arguments), which is unlikely to be the case.  An exponential random generator will need at least some 'lamda' value, more exotic things (power tail, normal, etc) will need several more, and if we want to generate them between a set range we will certain have to supply that as well.  But we can't pass them in as arguments, since the point is to specifically not know which generator we are using!

The solution to this problem is, of course, to abstract the arguments.  We want to create a record structure that is a 'black box' of arguments that come pre-filled in, and pass that as a whole to the function, which will know how to interpret them.  Now our subprogram looks like this:

void F (int N, float (*rng)(rngParms*), rngParms* p)
{
   for (int i=0; i<N; i++) printf("%f\n", (*rng)(p));
}

We pass in not only a pointer to the function we want to call, but the parameters we want to send to it.  Note that our program is still completely abstract, since we are not looking inside that black box of arguments.

The declaration of the rngParms type becomes tricky.  It needs to be variant, so that it can hold all the necessary parameters for any of the random generators.  We can do this by declaring different structures for each generator, and making a composite union.

struct Exponential_Parms
{
   float lambda;
}

struct Normal_Parms
{
   float lambda;
   float variance;
   float start;
   float end;
}

union rngParms
{
   Exponential_Parms e;
   Normal_Parms n;
}

Each function will be written to use the arguments it needs, while ignoring the others.  The calling function picks the arguments to use and the pointer to pass them too (which obviously must match), and the program remains indifferent towards all.

You've probably noticed that our trivial little program is starting to get much less trivial.  Pointers, complex unions, and indirect functions calls, and all we are doing is "printing random numbers"!  But this all goes back to reusability.  It was a good deal more work now, but think about how much more powerful this function is.  It's essentially "future-proofed".  If we need another set of random numbers in the future, this function never changes.  From now till the end of days, this file stays checked into the source tree, can be reused across any project, verbatim.  In fact, we never even have to recompile the damn thing!

So the real question becomes, is this OOP?  Most would say no, it's just good engineering design using a procedural language.  But that question misses the point.  OOP is just a syntax, a way to help automate good engineering design.  Lots of languages have special features so that passing around these pointers and variant records is much easier on you, but that's just icing on the cake.  Arguing about whether code is "object oriented" or "procedural" is like arguing if code is "Java" or "PHP".  There is no implication of quality, it simply is or is not.  But like we just saw, even derelict C code can be more abstract than most of the crummiest .NET code.  Code is either good or bad, and the way to tell if its bad is by having to change it.

Because at the end of the day, the language you use should just be personal preference.  After all, good code never has to be changed, so the only person that suffers by using a crappy language is you.  With abstract code, nobody ever has to go in and look at it (let alone change it), so nobody cares whether your brackets are in line or on their own line.  If you use a shitty untyped language, that's just more time you had to spend debugging and getting it right the first time, instead of drinking on your porch and judging the relative puffiness of clouds.

Of course, if you are writing bad code, then people will have to open it up endlessly to keep changing it.  Then it becomes an issue of where your brackets are, or how you indent, or whether the new guy out of college knows the language you used, or how you decipher the sparse comments that dyslexic coworker wrote, and things get much more complex.  The natural response is that "everything should be rewritten in that language I like", which makes about as much sense as saying you should buy new Snap-On wrenches to help rebuild your car's engine faster, since it seizes so often because you don't put oil in it!  You don't want to be able to fix it faster, you want it not to break in the first place!

So whichever side of the OOP fence you lie on or what language you use, take a long hard look at your code.  Is it abstract?  If I decide to change some arbitrary requirement, will I have to change the code at all?  If the answer is yes, then it doesn't matter what language you used:  It's abstraction that matters, not language.