Hot Wheels and Legos: The Two Types of Abstractions in Scientific Programming

I love science (who doesn't?).

I love programming.

I hate reading code from scientific papers.

The code in scientific papers are often the butt-end of some joke, usually an unspeakable amount of if/else statements sprawling across my emacs buffer, slapped together with single letter variable names and abuse of object-oriented programming paradigms. The worst part is, you can read the paper and follow along, then come to the code and have every notion of understanding fall apart. The kicker though? I completely sympathize with these scientists and mathematicians who wrote this awful code.

This isn't something new, HN even had a debate today on the virtues of scientific programming vs traditional programming best practices. I already threw my hat into that ring, but I figured I'd expand on it here for posterity's sake, and explain why scientists write such awful code. So indulge me in a story:

When I was a kid, I loved Hot Wheels tracks. You could put two tracks together and bam you've got a self-made race course! Of course, once you built the entire track, you couldn't really swap out parts or add new loops in the middle, the track had very little modularity once built. Plus, each track itself was already pre-built and immutable: either a left turn, right turn, or a straight line.

Then a couple years later I got off of my Hot Wheels kick and my parents bought me a Lego set from Wal-mart. And so I followed the instructions that came with it and built a little adorable penguin. However, right on the heels of my great penguin accomplishment my dino-obsessed 6 year old mind thought: "could I make this a T-Rex?" And after meticulous rearrangement of the 20-odd parts, I had myself a stack of black and white bricks in the vague shape of a T-Rex. Didn't matter though, because I was hooked. Legos could be anything! A penguin! A T-Rex! It could even be a boat!

So how does this relate to scientific programming and proper programming paradigms? Well, it's important to realize the difference between mathematical abstractions and programming abstractions.

Mathematical Abstractions are like Hot Wheels, each function, each integration a track. What each function does is relatively immutable and gives one output, just like each part of a Hot Wheels track.

Programming Abstractions are like Lego blocks, each function modular in nature, and adaptable in shape. You can combine, coalesce, and split programming functions with tools like inheritance, interfaces, and wrappers.

Now why do scientists write terrible code? It's not for a lack of training or intelligence, it's that scientists often operate on the level of mathematical abstraction. Usually their mathematical derivations have already been done on paper or in some other manner, and all they're using programming for is a fancy calculator. Which means they're only interested in how the program maps one-to-one with their mathematical abstraction, which leads to one-use spaghetti functions, egregious use of temporary files and variables, since all the scientist cares about is the end result. This completely flies in the face of programming abstractions, since suddenly this code is horribly immutable and strapped with a ton of technical debt. The scientist doesn't care, since they'll probably never revisit this code, and even if they do, they'll probably just rewrite it in the future.

How do we solve this? Besides educating scientists about the merits of future-proofing code and good programming practices, it's important to realize that programmers can learn just as much from scientists. We don't need abstraction towers reaching to the Heavens, or a Hadoop cluster spun up for a 200MB dataset. In the traditional Unix fashion, less is more.