Why Pipeline a Microprocessor?
Modern CPUs all use pipelining to improve performance, but how does it work? We will use a warehouse robot analogy to explain how.
Pipelining is in many ways such a universal way to speed up any task, that it is possible to find countless analogies for it in manufacturing or in your every day life, such as doing laundry, or cooking.
At the heart we deal with the challenge of getting as much work done per time unit as possible. There are many ways of achieving this. I will go through different approaches to help you understand how microprocessor designers arrived at pipelining as a smart solution. We will look at:
- Clock frequency. How early CPUs got speed bumps by increasing clock frequency.
- Parallel execution. If you cannot perform an instruction faster, how about performing more in parallel?
- Pipelining is for when you cannot do more tasks in parallel and you cannot miniaturize to increase clock frequency.
Speeding Up CPUs By Increasing Clock Frequency
A computer does everything in discrete time steps, called clock cycles. In one clock cycle the smallest task is performed in a CPU. It is a bit like how a mechanical watch only does something when cogs in its gears move one step. Nothing happens in-between.
The most straightforward way of increasing the performance of a microprocessor is by shortening the length of one clock cycle. However there is a limit to how short a clock cycle can be. You cannot make electrons move faster. They have to be able to move through all the necessary transistors in your CPU to complete the operation it is supposed to do in one clock cycle.
This is easier to clarify with a warehouse robot analogy. We got a robot in a warehouse picking packages in one end and delivering them at another end.
One could say this robot also has a clock cycle. The clock cycle is the time it takes for the robot to deliver a package at the end and return to the beginning. That means the task of moving a package is finished. In CPU terms this is the time it takes to complete an instruction.
Just like with electrons we have to imagine that it is not possible to speed up the robot. If we make the clock cycle too short, the robot will not get time to deliver the package and return in the allocated time.
Imagine the robot has only gotten halfway back, and then you instruct it to pickup a package and deliver it. Except it cannot pickup a package because it is not at the start location.
Benefits of Smaller CPUs
If we cannot make the robot move faster, how are we going to get the job done in less time? In CPU terms the question is: If we cannot make electrons go faster, how do we get them to their destination faster?
The simple answer is by making microprocessor smaller. In the warehouse analogy it means making the distance the robot has to travel smaller. Essentially we are reorganizing the warehouse to be more compact. We miniaturize it. Now the robot can move back and forth in shorter time, letting us shorten each clock cycle.
Shorter clock cycles is the same as having higher clock frequency. Thus miniaturizing electronics is one way of increasing the clock frequency and thus get more done.
But there are limit to how much we can miniaturize a microprocessor. Thus the next step for CPU designers was to do more in parallel. However there are many ways to achieve parallelization. Let us look at a couple of strategies.
Early super computers such as the Cray-1, used vector processors. In this case we attempt to do more by processing data in parallel. That meant one was performing only one instruction at a time, but it was being performed on multiple elements of data at the same time. The warehouse analogy would be to get the robot to pick up multiple packages on each run.
For a more in depth look at vector processing: RISC-V Vector Instructions vs ARM and x86 SIMD.
Multiple Microprocessor Cores
Another alternative would be to actually perform multiple instruction in parallel. That would be what we call multi-core CPUs today. The warehouse analogy would be to have two or more robots working in parallel.
Now we finally get to pipelining. What if you cannot make your microprocessor smaller? In warehouse terms this means you cannot shorten the distance to the destination where the package (data) needs to be delivered. We think of moving a package as performing an instruction.
The length we move is related to distance electrons have to travel through transistors. In this regard miniaturization helps. But this is also related to how complex the instruction is. The more an instruction does, the more transistors the electrons have to move through.
Thus we can think of instruction complexity as something that lengthens the path electrons have to travel. What if we don’t make the CPU itself smaller, but instead reduce the complexity of each instruction? If we make each instruction simpler, then the current has to go through fewer transistors. One way to do this is by splitting up an instruction into multiple smaller steps.
Clock cycle — 1
In our warehouse analogy this would correspond to adding more robots along the same line, and let each robot only move the package part of the way. Notice how this is different from parallelization where we add entirely separate lines where robots can move independently of each other.
In this case what happens is that the first robot does not move all the way to the end. Rather it moves a shorter distance and drops off the package, before returning. Remember we think of a clock cycle as the time it takes for every robot to move a package and return for next package. That means with this setup we can reduce the length of one clock cycle, which corresponds to an ability to increase the clock frequency.
Clock cycle — 2
On clock cycle number two, the second robot is able to begin moving the orange package while the first robot begins to pickup and move the yellow package. Also notice the warehouse guys are bringing in a new package, the green one.
When the second clock cycle is finished the, the orange package is available to be picked up by the third robot. The second robot meanwhile can pickup the yellow package. And the first robot can pickup the new green package.
Clock cycle — 3
On the third clock cycle all the robots move in lockstep, moving their package on step further. It is worth reflecting a bit on what is going on here. The time it takes for a package to move from the beginning to the very end hasn’t really changed. Imagine it took 30 seconds with one robot to move a package all the way to the end. Now moving a package one step, takes 10 seconds but we need tree steps, hence the total time is the same.
So what exactly have we gained here? The benefit is that every 10 seconds a new package is being delivered at the end, because we are constantly keeping packages in flight. We let new packages at the start sit and wait for a shorter time before a robot picks them up.
Benefits and Challenges of Pipelining
We can describe this with two important concepts in computer science:
- Latency the time it takes from an action is initiated until it completes. This could be e.g. the time it takes for a network package to be received after you first asked for it. In our case we can look at it as the delay in getting a single CPU instruction finished after execution began.
- Throughput is how much data, instructions or other things we are able to crank out each time unit. In the warehouse it would be the number of packages delivered per minute e.g. For our CPU it would be how many instruction we finish on average per clock cycle.
The throughput has improved because we are pushing out a package every 10 seconds, while before it took 30 seconds. We could have increased throughput as well by parallelizing operations. However that is not always possible. But this is basically how graphics card CPUs get their massive performance. They don’t do each individual operation that fast. Latency is in other words not necessarily great. However they do a massive number of tasks in parallel leading to massive throughput.
But for a general purpose CPU, massive paralyzation is usually not possible. Thus pipelining is important. But as we can see with our warehouse example, for this to work well each robot needs to move about the same length. Otherwise they cannot move in lockstep. Thus it is important to split a CPU instruction into multiple steps of equal length.
RISC vs CISC Microprocessors
This consideration is why early RISC processors such as MIPS and SPARC processor used in Unix workstations in the 1990s where so much faster than their x86 counterparts. x86 processors are CISC processors. CISC processor where never designed with pipelining in mind. However RISC designer had this in mind. RISC instructions where deliberately designed to all split into 4 logical steps. Thus one could build a 4 step deep pipeline. Thus a RISC instruction would take 4 clock cycles to perform, but every clock cycle one instruction would finish execution.
Read more: What Does RISC and CISC Mean in 2020?
Intel realized they had to find a way to pipeline their variable length complex instructions. Thus they came up with micro-operations. They would chop up complex instructions into multiple micro-operations which where then simple enough to be feed into a pipeline.
Read more: What the Heck is a Micro-Operation?
Taking Pipelining Too Far
Thus Intel began splitting up their instruction into smaller and smaller parts which allowed them to dramatically increase the clock frequency. Thus e.g. the Pentium 4 was a clock frequency beast. This made the pipelines on Intel processor really long.
And here we hit on a thing I have not yet covered: Branching. You see instructions don’t come as one endless predictable linear stream. Real programs make decisions and repetitions. Some instruction are repeated over and over again. Different types of data needs to be handled by different instructions. Hence in real programs there are conditional jumps. A condition is examined and then based on that condition the CPU makes a jump in the code.
How is this a problem for pipelining? Imagine you got a really deep pipeline with 100 instructions in flight. You can visualize this as 100 of our warehouse robots lined up moving 100 packages in lockstep. The last instruction being finished is a conditional jump instruction. Depending on the outcome of this instruction, the CPU may realize that all those 100 instruction currently in the pipeline are actually not the correct ones to execute. We are supposed to pick up entirely different instructions somewhere else in memory.
Thus the whole pipeline has to be flushed and we got to wait a whole 100 clock cycles before another instruction ripples through the pipeline to the end. That is a big performance hit.
This is why Apple famously had slides comparing RISC based PowerPC chips, they used their Macs until 2006, with Intel’s Pentium processors. The Pentium processor had far higher clock frequency, but also much longer pipeline. Back then people thought of clock frequency as equivalent to performance. As we have seen, it is clearly related but it is not quite the same.
The Pentiums would have their very long pipelines frequently flushed causing a major performance hit. Meanwhile the PowerPC chips had lower clock frequency but much shorter pipelines leading to a much smaller performance hit from conditional jumps.
It is worth nothing that since then we have reduced this problem by utilizing branch predictors. This means guessing where a branch would go before actually executing the instruction testing a condition.
This will likely be the topic of my next technical article.