Thursday, August 23, 2012

Everything you learned about programming is wrong

Pop quiz, hotshot: How big is an integer?

If you came of age anytime in the last thirty years and did even the least bit of programming, this is likely ingrained into your skull alongside Star Wars quotes, your high school locker combination, and the invulnerability code for DOOM (iddqd).  The sky is blue, water is wet, women have secrets, and an integer ranges from +/- 2.2 billion (give or take).

Okay, now for the harder question: why?

Ever since I can remember, this was a nonsense question.  The first thing you did when you learned a programming language was find the page of the manual that described the primitive data types: the building blocks of the language.  You then represented your data within these types, created larger, composite types out of them, fashioned algorithms around them, and you didn't ask silly questions about why the ranges were what they were.  There is no why!

Pretty much every single language I ever used was like this.  C/C++ is like this.  Java is like this.  Assembly language like this.  SQL is like this.  Scheme is like this.  Hell, even GW-freaking-BASIC is like this.  This is just how programming is done.  You get three basic integral types (byte, word, dword), two floats (single and double precision), and force your program to fit into what the compiler gives you.

But consider that while variables are normally compiled into blocks of memory, it is quite likely that your program is not dealing with blocks of memory.  You are dealing with things like engine speeds, the contents of a text document, pictures, the number of bullets a video game character has left, whether a bridge is raised or lowered, a video stream, the current color of a traffic light, and an infinite amount of other "real life" things.  Our whole business is using numbers to represent things that are not numbers: why does the compiler get to dictate what those things are?

Suppose you are writing a program that monitors the speed of an engine.  If you grew up learning the languages listed above, then you would likely do something like this (assuming we use whole numbers):

int Engine_Speed;

Of course, if you are a C programmer, you are required to name the variable 'q' so that no future programmer can ever decipher or maintain your original work; but I digress.  The point is that we have integral data, so we use whatever data type is most compatible.

But this is wrong.  Not just in syntax but in concept. At a high level, we are representing engine speed.  Engine speed does not have a range of +/- 2.2 billion, like an C-style int does.  The speed of our engine is not an integer, and it never will be.  A integer is a meaningless, unitless number that represents nothing.  It's just that the compiler happens to be running on machine with a 32-bit data bus, which makes it convenient for the compiler writer since each source-level statement corresponds to a similar machine opcode.

But this "one-size-fits-all" data typing is fundamentally flawed.  You are the programmer, you should get to pick the data types!  Our engine can't spin at 2.2 billion RPMs, so in what universe does it sense to pick a data type with that range?  Because it's easier on the compiler?  Seriously?

Okay, now lets suppose we also need to represent the state of the engine.  Like any good C programmer, we also pick an int because, um, well, that's just what we've always done:

int Engine_Speed;
int Engine_Status;

(Also, again, in a real C program the status would be named 'x7' to spite those with the gall to want to read you code later).  Our status now has the same range of +/- 2.2 billion, just like speed, which clearly doesn't correctly represent our engine status of "on" or "off".  So we just sort of come to a gentleman's agreement that 'zero' will be off, because the 'o' in off looks sort of like a 0, and that anything else will be on (which also starts with an 'o', but we can't be concerned with that!  There is code to write!)

But come on!  Our engine is not 'zero' or 'non-zero', or even 'true' or 'false': it's on or off.  These are the different values our variable can have, on and off, and nothing else.

The fact that the variables we are using to represent 'things' don't correctly represent those 'things' is bad enough, but it just gets worse from there.  We are using the same type, so the compiler will tacitly assume they represent the same data.  We could, for instance, call Set_Engine_Status(Engine_Speed), which makes no sense at all, or Increase_Engine_Speed(Engine_Status), which will either stop the engine or overdrive it, depending on what arbitrary value we decided to use for the 'on' state (remember, it's any non-zero, so it could be 2.2 billion).  Both such errors (which are easily copy and pasted) would be revealed at best as subtle logic errors, or at worst as a wrecked engine.

And it's this epiphany I had when I started using Ada: everything I had learned about data types, from my days hacking GW-BASIC, to my college courses, to the early part of my professional career, was just wrong.  Bet-on-the-wrong-horse wrong.  Back-to-the-drawing-board wrong.  Fundamentally flawed.   The idea that you should have to force the hundreds (if not thousands) of different types of data your program has into the five built-in ones that the compiler shoves down your throat is a miserable way to write a program. It's like the scene in 'My Cousin Vinny': Breakfast?  You think?

And this is C's (and C++'s, and Java's, and Basic, and .NET's) real Achilles Heel.  We can argue about brackets, pointers, and the preprocessor all day long, but at the end of the day C will never be able to escape its fubar type system.  Historically, C was designed to do systems programming, i.e. Unix, along with the requisite debuggers, compilers, and what-have-you.  But that's the key: when you are doing systems programming, you are working in the problem domain of registers, blocks of memory, and CPU data buses.  If you are writing a debugger, then you do want your data types to correspond to the physical hardware.  And that's why C is still perhaps the best systems programming language.

But it's just not suitable for everyday programming, and neither is any other "general-purpose" language that sets you up with a list of built-in types to use, because my programs have more than five types of data and I'm sure yours do too.  A data type is not something the compiler gives you, it's something you give the compiler.

Which brings us to Ada:

Ada doesn't have any built-in types.

Not a single one.  Every type is a user-defined type.  When you want a variable, you first declare a type for that specifies what the data is.  For instance, in Ada, we would implement the previous example like so:

type Speed is range 0 .. 10_000;
Engine_Speed : Speed;

type Status is (On, Off);
Engine_Status : Status;


Syntax aside, the difference is that we didn't just pick an arbitrary data type that was 'close enough', or that we could rationalize by squinting our eyes hard enough. We defined our own types that are not just easier to read, but also properly ranged checked and type checked (unlike the ultimately useless typedef and enum that C provides).

The primary benefit it saves us the trouble of range checking them.  We can safely assume that the engine speed will never be 2 billion, because it can't.  If we ever try to increase the speed above 10,000 or below 0, an exception is raised and the program ends (or, less often, handles it).  If we try and do something silly at compile time, such as assign it 999,999, we will get an error because it's not appropriate.  If we mix up status and speed somewhere, we will get an error, because speed is not compatible with status, and vice-versa.

So really, why are you still fooling yourself?  Why do you still believe that the age of an employee is an 'integer', or that the cost of a widget is a 'float', or that an image is nothing but an array of chars?  These rationalizations are wrong, wrong, wrong!  It's an outdated, antiquated way of thinking from the days of assembly language.  Just because two types have values that are implemented the same in assembly language doesn't mean they are the same type up at the level that programmers need to think at.  Apples are not oranges, and shame on you for letting your compiler browbeat you into thinking they are.

This is why Ada works, and everything else doesn't.  That's why on a cross-country flight, it never occurs to you that the only thing keeping your ass in the air is software.  That's why you never have to reboot your cars engine.  And that's why if your language of choice presents you with a list of data types and says "choose", you need to turn around and find a new language.






No comments:

Post a Comment