Very Long Instruction Word Microprocessors
RISC and CISC type microprocessors are not the only game in town. VLIW microprocessors where once believed to be the future but not anymore. What happened?
In the early 2000s Very Long Instruction Word (VLIW) microprocessors was all the hype to the point of causing several chip makers to simply drop developing their chip development process and sit back waiting for microprocessor Nirvana to arrive.
Transmeta with their Crusoe and Intel with their Itanium processors promised to bring about this VLIW revolution. Except if you look around, we are not in fact living in a a VLIW microprocessor Utopia.
So let us look at what these VLIW processors are like and try to explain why they failed.
Doing Work in Parallel
Because we cannot increase the clock frequency of microprocessor much anymore without overheating them, everything is really centered around way of doing as much work as possible in parallel.
We have developed a number of ways of doing this. The most straightforward approach is to have multiple microprocessor cores. Each core basically runs a separate program. This could be actual separate programs or programmers can split up their own programs in a such a way that parts of the program runs as if program was made up of smaller programs. We call these tasks, and each task can run on a separate microprocessor core.
Another popular approach is to not perform multiple instructions in parallel but instead have instructions with manipulate multiple numbers at the same time. We typically call this vector processing. Computer graphics often lend itself to that sort of processing.
Scalar vs Superscalar Processors
A CPU core which performs one instruction at a time is called a scalar processor. So a CPU with multiple cores is still a scalar CPU because the individual cores are scalar. However if one core is able to execute multiple instructions each clock cycle we call it superscalar. VLIW microprocessors are a type of superscalar processors. But there are many ways of achieving this. Before getting into that, we need to better understand how a regular scalar CPU works.
How Scalar Processors Work
But how can a single CPU perform multiple instructions in parallel? You see a CPU is not a monolith. It contains several highly specialized parts:
- ALU — Arithmetic Logic Unit. This one carries out addition and subtraction of integer numbers such as 4, 5, 133, 80.
- Integer Multiplier/Divider Unit — Performs multiplication and division of integer numbers.
- FPU — Floating Point Unit. Performs multiplication division, addition and subtraction of floating point numbers such as 3.4, 2.25, 80.31 etc.
- LSU — Load-Store Unit. Takes care of fetching data from memory and putting it in a microprocessor register.
Inside the CPU there is an instruction decoder. It decodes an instruction and figures out what job needs to be done. Let us look at a simple example of a mathematical expression and how the CPU would deal with it:
y = a*b + c*d
This is what you would write in code, and a compiler will translate this to assembly code (or machine code to be specific) which the CPU understands. This mathematical expression gets broken down into multiple simple instructions understood by the CPU:
load r1, a ; r1 ← a Load contents of a into register r1
load r2, b
load r3, c
load r4, d
multiply r1, r2 ; r1 ← r1 * r2 Multiply r1 and r2. Store in r1
multiply r3, r4
add r1, r3 ; r1 ← r1 + r3
store r1, y ; r1 → y Store register r1 at location y
You can think of the internals of a CPU as an elaborate system of pipes connecting registers to different function units. In this chemical factory analogy you can think of registers as tanks with chemical compounds you want to mix in different ways. In reality of course they are just containers for numbers you want to do something with.
The decoder will look at the instruction and open the valves to the right register “tank” to let the chemicals pour out of it. In principle a number contained in a register can move to any functional unit. They are all connected.
Thus the decoder also opens up the valves to the specific unit we want to use, while keeping the valves to the other units shut.
Hence when we read instruction
load r1, a the decoder will activate the LSU, so it can load data from memory location
a. It also opens the valves between register
r1 and the LSU, so that the number loaded from memory flows into register
In contrast when reading instruction
multiply r1, r2, the decoder will open the valves to register
r2. The multiplier has two separate input ports. The decoder opens the correct value on
r1 and one of the ports on the multiplier so that the number in
r1 flows into the first port of the multiplier. Similar actions are performed on the other input port and register
A valve for the output is setup to send the result to register
r1 afterwards. Similar happens for the
add r1, r2 instruction. Except in this case the values are opened to the ALU (Arithmetic Logic Unit).
Please note, I am simplifying the l language here. Semi conductors don’t contain little pipes and valves for electrons. Rather they use something called multiplexers instead.
How Superscalar Processors Work
An important observation to make about the previous example is that while e.g. the ALU is adding two numbers, the LSU, multiplier and FPU are just sitting idle doing nothing. That is a waste of resources. This got microprocessor designers into thinking: Can’t we use these at the same time?
For instance while the ALU is busy adding registers couldn’t the multiplier be multiplying two registers? In our specific case we cannot do that because the addition depends on the results from the multiplications.
But not all instructions depend on each other. For instance
load r4, d and
multiply r1, r2 are two entirely independent instructions. They could in principle be run in parallel. The multiplier could begin multiplying
r2 while the LSU loads
To achieve this we need to have multiple decoders so we can decode more than one instruction at a time. However if we let each decoder start enabling and disabling multipliers and ALUs at the same time we would get chaos.
Thus there are two ways of doing this. We will look at the VLIW solution first, because conceptually this is easier.
But first a remark on some similarities with all Superscalar CPUs: For them to be practical we need multiple ALUs, Multipliers and FPUs. That way we can e.g. perform say 3 integer additions and 2 multiplications at the same time. If we don’t have multiple functional units of each kind, then you can only execute something in parallel if they use entirely different kinds of functional units. That would limit the amount of parallel executions you could achieve.
Parallel Instruction Execution with VLIW
The idea with VLIW is that you bundle multiple instruction together. So e.g. the microprocessor would read 4 instructions at a time. These instructions are not like normal CPU instructions, but contain information about what functional unit to schedule each instruction. Say we have these four instructions bundled together:
add r1, r2 : ALU1
add r3, r4 : ALU2
multiply r5, r2 : Multiplier1
multiply r6, r4 : Multiplier2
Then they would contain some extra information about what ALU and multiplier to use to avoid collision.
For all of this to work we need very advance compilers which can take an expression as seen before, and turn it into assembly code, where instructions are bundled in an optimal way and where info about what unit to schedule each operation to.
y = a*b + c*d
At this point it may start to become a little clearer why VLIW CPUs never worked out well. You compiler will have to have intimate knowledge about the micro-architecture of the CPU you are compiling for.
Let me just clarify what the difference between CPU architecture and micro-architecture is. The CPU architecture is compromised of:
- ISA — Instruction Set Architecture. What assembly code the CPU understands. This exposed operations you can perform. What registers you can use in the code etc.
- Micro-Architecture — How the CPU is actually implemented. Its functional units, branch predictor, instruction decoders etc.
There are many ways to look at this. Imagine any electrical equipment. You got an interface to it which are things like USB ports or HDMI ports. That is kind of like an ISA. The Micro-Architecture is how the whole thing works inside. Usually we want to cleanly separate the two.
I don’t need to know how a USB mouse or how a computer works. As long as both have USB ports and cables I can connect them. Knowing specifics about how the mouse or computer works is not needed. This is in a lot of way of the modern world was built. We but all the complexity inside black box and provide simple clean interface to this black-box which abstracts away its internal behavior.
With VLIW you break with this tradition. Should you decide to change the Micro-architecture, by say adding more ALUs or decoders then you are screwed. You need recompile your program to include scheduling information appropriate for this new architecture.
Yet this approach has had a great allure, because you can simplify the microprocessor a lot. All the complex logic of figuring out what ALU, multiplier etc should perform a calculation gets done in software by the compiler. There is not shortage of clever suggests for how to get around this. One could e.g.compile to an intermediate format and use Just in Time compilation (JIT) to fit the architecture.
Anyway this is a great starting point to look at the alternative to this approach.
Out of Order Execution (OoOE)
With Out of Order Execution (OoOE) the CPU will read a fairly normal CISC or RISC stream of instructions. The instructions don’t contain any information about what ALU, multiplier or FPU the instruction should be scheduled to run on. The benefit of that is that we can hide the internal implementation of the CPU. The compiler and the programmer doesn’t need to know that the CPU is able to run instructions in parallel. That is an implementation detail.
That sounds great! What are the downsides? Obviously there are some major downsides to this approach as well. Doing this requires a far more complicated microprocessor.
First of all we cannot simply decode instructions in parallel and start shipping them to functional units. You need to coordinate all of this. So to handle this, decoders for an OoO CPU cannot activate functional units directly. Instead what the decoders do is to produce a new set of instructions called micro-ops. Unlike normal RISC and CISC instructions these contain a bunch of meta information which is specific to the current micro-architecture. The kind of information the VLIW instructions would contain.
These micro-operations are put in an instruction queue. But instructions don’t necessarily have to wait in line. If an instruction has no dependencies on results from a previous instructions then it will be immediately scheduled to a LSU, FPU, multiplier or ALU for execution.
However the results of this execution is not written to registers or memory immediately. The reason is because the state of memory and your registers has to happen in the same sequence as predicted by the order of the instructions. Otherwise you create complete chaos.
Thus every instruction is kept in what we call a Re-Order Buffer (ROB) which keeps track of the state of the instruction. The OoO system will mark off instructions which have finished execution. Only finished instructions in the front of the ROB may be committed at any given time.
Until an instruction is committed its results are really stored in kind of fake versions of the actual registers. Modern CPUs actually have tons of these shadow registers so that many instructions can actually use the same register in parallel as long as there is no dependency among them.
But from the point of view of the assembly programmer these shadow registers don’t exist. There are no assembly instructions to read from them. Hence results stored in these registers have to be eventually committed to the official known registers.
Keeping track of which instruction which is good to go, what functional unit it can be scheduled to, performing the commits etc requires a lot of complex logic in the CPU which adds a lot of transistors. A VLIW CPU will typically be much smaller.
Why OoOE Beats VLIW
Yet in practice, given that you have enough silicon available OoOE offers the superior solution. I key reason for this is that VLIW relies on a static analysis of how instructions should be scheduled.
And that is simply not very optimal. How long an instruction take to execute can vary consierably. A lot of store instruction can be significantly slowed down if one is trying to access memory not in CPU cache.
Jumps around the code at runtime cause an unpredictable stream of instructions which an OoOE unit can discover instructions which can run in parallel. A VLIW CPU cannot do this, because it doesn’t know what is would happen at runtime when scheduling happens.
In many ways the comparison here is similar to the pros and cons of Ahead of Time compilation (AOT) vs Just in Time compilation (JIT). AOT compilation allows you to spend a lot of time up front to optimize but you cannot take advantage of information only known at runtime. A JIT in contrast cannot spend as much time optimizing but is able to exploit knowledge about current execution.