Alecco Locco: December 2010

A couple of years ago, Donald Knuth said in an interview^[1]:

Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading.

Most algorithms were designed with a classic approach of doing one instruction at a time. This is called scalar processing. Algorithms are then implemented on a programming language and finally run compiled or interpreted. There are important issues on these transformations of the code, but that's out of the scope of this post.

The number of transistors inside processors has been roughly doubling each year^[2]. Due to the nature of the transistors electricity gets transformed to heat. A lot of heat in a very small space packed with millions of transistors. So CPU manufacturers have reached a hard limit of CPU power heat dissipation^[3]. The only way out of this dead end was to increase the number of CPU cores, stopping the Ghz race.

Another big issue with current computers is latency. While processors can run multiple instructions per nanosecond (ns), accessing memory takes about 100 ns. To work around this problem manufacturers place memory caches running at speeds much closer to CPU execution times. But this caches are very expensive and only processors aimed at the high-end server market have big configurations.

If programs were executed one instruction at a time most of the CPU would be idle waiting for a particular unit to finish its current task. Very often instructions in a given block of code can be run in parallel because they are not dependent (e.g. initializing unrelated variables.) The CPU performs this optimizations by having a deep pipeline of instructions currently executing (resembling assembly lines in factories.) To manage this pipeline the circuitry has to determine the instruction dependencies and sometimes reorder the instructions to improve throughput. Good compilers and VMs optimize compiled programs by ordering the instructions to match the characteristics of different CPU models.

When the flow of control of a program reaches a conditional branch (if-then-else) the processor evaluates the condition (e.g. variable a is zero) and make a jump (or not) to the following instruction. This evaluation usually takes long time and disrupts superscalar pipelining. To overcome this the processor has a dedicated unit called branch predictor^[4] to evaluate conditional branches by remembering what branch was taken before. If the predictor is successful the flow is keeps running fast uninterrupted. But when there's a branch misprediction the wrongly picked instructions in the pipeline must be undone. This causes a pipeline flush often costing 5ns. For algorithms with many conditional branches with a probability close to .5 this can multiply into major bottleneck (e.g. walking binary trees or classic compression.)

The premises for making algorithms in the literature are completely out of touch with all these issues. One of the old maxims was to avoid full scans at all costs, but a sequential memory scan nowadays goes at 6 to 14 GiB per second and the limits are mostly the data bus bandwidth. Random access traversal in memory often is several orders of magnitude slower due to the latency issues compounded with branch mispredictions. The data structures often used don't scale due to fragmentation. In many cases are sequential in nature so parallel execution is either impossible or requires very slow locking mechanisms for critical areas making the code error-prone and filled with hard to predict bottlenecks.

To use larger amounts of main memory the processors switched to big data addresses. Every memory reference in a 64bit address space requires 8 bytes. In very common scenarios where linked structures are widely used for data storage in memory these huge pointers become a significant overhead. For example, each node of a binary tree will require at least 16 bytes and often 8 more to link to parent (as in popular red-black tree implementations.) If the node only stores 32bit integers the overhead is 400% to 600%!

In many cases the same code is applied to many elements of a data structure iteratively. With current processors it is possible to run the same instructions to packs of data and it's called SIMD (Single Instruction on Multiple Data.) This is widely used for media processing but it is becoming common for general purpose data processing. Most SIMD implementations are evolving to give better support for non-multimedia uses (e.g. Intel's SIMD string instructions.) Some of the most interesting new algorithms of the last decade were redisigns to exploit SIMD processing. Programming directly to one of this SIMD implementations can be quite tricky but it is possible to simplify this with compiler intrinsics or re-targetting compiler tools (like MIT's Cilk.)

There are many interesting instructions available on processors performing very useful tasks (e.g. bit scan) completely neglected by programming textbooks. The savings in time and complexity for algorithms can be very significant.

I differ with proffessor D. Knuth and think this is a great opportunity and all this limitations actually make these times very interesting for computer scientists. It's time for a revolution! Almost everything has to be revised to match the new hardware paradigm. For this there's a need for an army of Knuths to find the way. We need to shake the CS establishment out of their comfortable leather chairs.

[Note: I do not mean Prof. D. Knuth, in fact he is one of the very few CS masters who ties high level theory with low level realistic implementations.]

The New Algorithm Design Maxims

Find alternatives to sequential processing
Minimize memory use and avoid bloat (Latency)
Store related data as close as possible (Caching)
Minimize branch mispredictions or remove branches altogether
Favor vector/matrix data structures over linked nodes (Pointers, Caching)
Exploit vector processing if possible (SIMD)
Embrace specialized instructions widely available

A very common approach to speed up processing is to perform space-time trade-offs. For example changing programmatic code to big table lookups. While this works in the small micro-benchmarks it is a short-sighted trick that usually makes scaling very hard. CPU cores can perform billions of instructions per second if used wisely and the new trend for big data processing is to do the opposite, compute-space trade-off. In particular with fast compression algorithms.

Good times.

[1] InformIT: Interview with Donald Knuth (2008)
[2] Wikipedia: Moore's Law
[3] Wikipedia: CPU power dissipation
[4] Wikipedia: Branch Predictor