ARM, x86 and RISC-V Microprocessors Compared

A comparison of different design choices in the assembly language of three important microprocessor instruction-sets.

Image for post

In the PC world x86 microprocessor from AMD and Intel dominate. On tablets and smartphones ARM chips from Qualcomm and Apple dominate. RISC-V is a new microprocessor instruction-set which various companies are starting to use.

Thus for those with some interest in assembly programming, I thought I would do a comparison of these chips, in how they deal with common operations and the rational for their differences.

Find this story hard to follow? Read: How Does a Modern Microprocessor Work?

Instruction Length

For RISC microprocessors such as ARM and RISC-V this is simply. Every instruction is 32-bit long (4-bytes). This is very common for RISC microprocessors: ARM, MIPS, RISC-V and PowerPC all use fixed length 32-bit instructions.

However don’t confuse this with whether it is a 64-bit or 32-bit microprocessor. A 64-bit microprocessor will typically have 64-bit registers which it can work on. However the instructions themselves will still typically be 32-bit. The reason is simple: You generally don’t need as much space as 64-bit for an instruction and if you have that long instruction you end up doubling the memory requirement for your binary code.

However there are some exceptions to this. The AVR microprocessor is an 8-bit RISC processor used mostly in microcontrollers such as the Arduino popular for hobbyists. It has 16-bit instructions. In fact you can pull off a limited RISC instruction-set on 16-bit. Which is why many CPU architectures including ARM, MIPS and RISC-V all have support for 16-bit instructions. We call these compressed instructions. The CPU still reads in 32-bit bits at a time. But once it is received the CPU can determine that it is a compressed instruction and inflate it to two normal 32-bit instructions.

CISC CPUs such as x86 are a big difference here. Their instructions don’t have a fixed length. For x86 instructions can be from 1 to 15 bytes long (8 to 120 bits). Actually in theory an x86 instruction could be of infinite length, but dealing with infinitely long instructions is impractical. Thus both Intel and AMD set a practical limit and refuse to process instructions which are encoded as longer than 15 bytes. People who write compiler know this and will of course avoid outputting instructions longer than this.

Operands and Registers

Operands are the inputs to an assembly instruction. We can look at some pretty common instructions such as adding, subtracting and multiplying. Please note everything after the ; is typically a comment in Assembly code. It is not part of the instruction. I am adding little comments behind each instruction to explain what it does.

ADD x1, x4, x6  ; x1  ← x4 + x6  
SUB x1, x4, x6 ; x1 ← x4 - x6
MUL x2, x4, x6 ; x1 ← x4 × x6

This is what addition, subtraction and multiplication looks like on ARM, RISC-V. Yes, a lot of the more common instruction will look very similar on ARM and RISC-V. Registers are even named the same. They go from x0 to x31.

ADD  eax, ebx   ; eax  ← eax + ebx
SUB eax, ebx ; eax ← eax - ebx
IMUL eax, ebx ; eax ← eax × ebx

The x86 instructions look quite different in style. Registers have different names. This needs some explaining.

Because x86 has a long heritage all the way back to the Intel 8086 microprocessor which was 16-bit. The first instructions was thus made to deal with 16-bits and 8-bits. Various suffixes and prefixes has been added over the years to deal with larger registers. So the original 16-bit registers where called ax, bx, cx, dx, si, di, bp, rsp.

Thus to deal with 32-bit we have gotten registers named eax, ebx, ecx, etc which are the 32-bit long versions.

Another difference is that the x86 instructions only take two operands. This may be easier to explain by explaining why RISC instructions tend to take 3 operands.

Since RISC is all about being fixed width instructions, it makes no sense to not use all the space available. You need 32-bit to be able to fit a reasonable sized address, such as a 16-bit address. Thus when operating on only registers rather than memory addresses we have at least 16-bit available to encode registers.

We could have two operands, which would give us 8-bit for each. But 8-bit is enough to specify 256 different registers (2⁸ = 256). Do we really need that many? Probably not. Thus most RISC CPU designers have decided it is better to use 5-bits to encode each register. Thus we can encode three different registers with 15 bits. 5-bits gives us 32 different registers (2⁵ = 32).

This kills two birds with one stone. We utilize bits available to us, but with 3 operands we can much more easily juggle data around in registers. This reduced the number operations we need to perform and the number of reads and writes we need to do to memory.

Special Registers

Intel x86 processors are full of special registers. What we mean by that is that there are certain operations which are used for particular instructions. E.g. rsi and rdi are used for indexing related operations. rbp is used for the stack frame (area located for local variables when calling a function).

RISC processors are quite different in this regard. Typically most registers are general purpose. Sometimes they have an extra use, but they can be used as operands for almost any instruction which takes general purpose registers as operands.

Clearing a Register

For RISC processors to simplify operations there is often a register which is designated as a zero register. What this means is that that register always contains the number zero. You cannot change it. This may sound like an odd hack, but I will make it clear why this is a very elegant solution.

For Intel x86 if I want to clear a register (set it to zero) I can write instructions like this:

MOV eax, 0       ; eax ← 0
XOR eax, eax ; eax ← eax ⊻ eax

On x86 the exclusive OR (XOR) version is usually preferred as it requires fewer bytes to encode. With RISC processors we instead utilize the fact that there is a register always containing zero.

On ARM we can clear a register x8 by using the zero register x31 which has an alias xzr.

MOV x8, x31   ; x8 ← x31
MOV x8, xzr

For RISC-V the zero register is x0. However RISC-V does not have a normal move instruction. Let me explain with an example:

MV x1, x4    ; shorthand for ADDI x1, x4, 0

The move instruction MV on RISC-V is what we call a pseudo instruction. It translates to an ADD immediate ADDI instruction. Thus if you did a disassembly of a program where you wrote a bunch of MV instruction they would come back as ADDI because the disassembler would not necessarily know that you where trying to express a move.

For this reason we typically use the AND immediate ANDI instruction on RISC-V to clear a register.

ANDI x8, x8, 0    ; x8 ← x8 & 0

Sometimes the zero register in a RISC processor can have other purposes. E.g. on ARM in special contexts it means the stack pointer (memory address where locally allocated variables are placed).

How is this possible? Well the zero register doesn’t physically exist on the silicon die anywhere. It is just register number that the CPU instruction decoder chooses to interpret in a particular way.

Elegance of the Zero Register

With the zero register we can easily create a whole host of useful pseudo instructions, without adding actual extra instructions to the microprocessor. Remember a pseudo instruction is just a shorthand for another instruction.

For instance many processor have a special instruction for making a value negative called NEG:

NEG x2, x4       ; x2 ← -x4

However on RISC-V (and possibly ARM) NEG is just a shorthand for this SUB instruction:

SUB x2, x0, x4   ; x2 ← x0 - x4 equals  0 - x4

They don’t need to set aside special bits to encode a special NOP operation. You can just have an instruction sending result to x0.

ADDI x0, x0, 0 ; ; x0 ← x0 + 0

It is used a lot in RISC-V branch (jump) instructions we will look at later. For instance this, Branch if EQual to Zero BEQZ, checks if x2 is zero, and jumps to label foo if that is the case.

LI x2, 4 ; Load 4 into x2. Pseudo instruction
BEQZ x2, loop ; IF x2 = 0 GOTO loop

It is just a shorthand for Branch if EQual BEQ:

BEQ x2, x0, loop  ; if x2 == x0 GOTO loop

This is a queue to look more at how ARM, x86 and RISC-V deal with conditional branching.

Conditional Branching

Branching is what allows us to jump around in our code, so that some instruction can be repeated many times. Conditional branch means a jump is only made in case a certain condition is met. This is how we implement for-loops, while-loops and if-statements in assembly code.

In this case x86, ARM and RISC-V are quite interesting cases because they all happen to have quite different approaches to conditional branching.

In this case it is actually useful to look at RISC-V first, because it does branching in a way more similar to how regular programming languages work. This simple program basically counts up from 1 to 12 in the x4 register.

  LI x4, 1            ; x4 ← 1
LI x5, 12 ; x5 ← 12

ADDI x4, x4, 1 ; x4 ← x4 + 1
BLT x4, x5, loop ; IF x4 < x5 GOTO loop

Learn more: RISC-V Assembly Interpreter

The Branch Less Than BLT instruction makes a jump to instruction at label loop if register x4 < x5. While this is quite natural to work with, it is not how assembly code normally works.

Instead register are compared with a separate instruction which causes bits in the status register of the CPU to be set. The individual bits are referred to as flags and usually have single letter names such as C, N, Z and V:

  • Carry C - When you have added, multiplied or done another operation producing a number that is too large to fit into destination register. If that is the case this flat will be set to 1.
  • Negative N - The result of a computation or comparison gave a negative number.
  • Zero Z - Both numbers where equal, or somehow the comparison or test produced zero as the result.
  • Overflow V - You cannot store a negative sign inside a CPU register. E.g. an unsigned 8-bit number goes from 0 to 255, while a signed goes from -128 to 127. Thus if you added two signed 8-bit numbers and result went above 127, you would have an overflow.

Okay let us look at a more traditional instruction-set like x86. This is using 32-bit registers. They have e prefix, while 64-bit registers would have had r prefix:

    MOV ecx, 12
MOV eax, 1

ADD eax, 1
CMP eax, ecx
JL loop

Learn more: x86 Assembly Interpreter

In this case we do the comparison between registers eax and ecx using a separate CoMPare CMP instruction. It works similar to subtract SUB but only status registers get set. No result is stored in any of the operands.

The Jump Less JL instruction looks at the status register to determine if eax < ecx, before making jump to loop location.

The code for ARM processors is quite similar. Except here are we showing 32-bit ARM instructions. Previously we showed 64-bit ARM instructions which are slightly different:

  MOV   r4, #1     ; r4 ← 1
MOV r5, #12 ; r5 ← 12

ADD r4, r4, #1 ; r4 ← r4 + 1
CMP r4, r5 ; status_flag = r4 < r5
BLT loop ; IF < GOTO loop

Learn more: ARM Assembly Interpreter

Now the ARM example looks very similar to x86, but wasn’t ARM supposed to be quite different? Yes, because in addition to branching like this, ARM has conditional instructions.

ARM Can Conditionally Execute Single Instructions

Consider this nonsense x86 assembly code example with some branching instructions:

    CMP ax, 42
JE equal
MOV bx, 12
JMP done
MOV bx, 33

We are checking if the AX register is equal to 42 using the CoMPare instruction CMP. This sets the status flags. Afterwards we use the Jump if Equal JE to jump to the label equal if AX contained the number 42.

If that was the case we put 33 into register BX with the MOVe MOV instruction. Notice how we need to use a bunch of jumps to make sure MOV BX, 12 is only run when they are not equal and nothing else.

With ARM a lot of this kinds of code becomes a lot easier, due to conditional instructions.

CMP    r6, #42  
LDREQ r3, #33 ; r3 ← 33 if r6 = 42
LDRNE r3, #12 ; r3 ← 12 if r6 = 42

What happens here is that we compare register r6 with the number 42. If they are equal the Zero register Z is set to 1. In addition every comparison causes the special conditional register to be set.

Every instruction in the ARM instruction set can be conditionally executed, by adding a two letter suffix such as EQ, NE, GT, LT etc, which corresponds to the comparison operators =, ≠, >, <.

So a normal load instruction that should always be executed is written LDR, while one that should only be executed if last comparison result was equal would be LDREQ. A load for not equal would be LDRNE.

This applies to any instruction. So a normal unconditional add is written ADD while an add which should only be run if last result was not equal would be written as ADDNE.

In addition we can add an S to various operations to cause them update the conditional register.

SUBS   r3, r6, #42   ; r3 ← r6 - 42
ADDEQ r3, #33 ; r3 ← r3 + 33 if r6 - 42 = 0
ADDNE r3, #12 ; r3 ← r3 + 12 if r6 - 42 = 0

Above I am adding 33 to register r3 if r6 was equal to 42. If they where not equal I am adding 12. SUB is the normal subtraction, while SUBS updates the conditional flag. In effect it has the same effect as CMP except the result is stored in register r3.

x86 Conditional Moves

While x86 is not as known for conditional instruction, i686 actually added conditional move CMOV as an optional extension. We can thus replicate our ARM example:

CMP    ax, 42
CMOVE bx, 33 ; bx ← 33 if ax = 42
CMOVNE bx, 12 ; bx ← 12 if ax ≠ 42

The way you read this is that CMOVE stands for Conditional MOVe Equal, while CMOVNE stands for Conditional MOVe Not Equal.

Why Does ARM have Conditional Execution?

Branching is quite bad for a pipelined microprocessor. In pipelining instructions are queued up and the execution of one instruction begins before the execution of an earlier one is done. With a Jump the previous instructions in the pipeline where just a waste. We need to flush the cache. With conditional execution we avoid branching.

Sometimes we have to jump of course, but this cuts down the need for branching a lot.

Several other RISC architectures actually have conditional execution.

Here is an interesting discussion of conditional execution on ARM. Among other things how if-statements can be implemented on ARM.

RISC-V Does Not Have Conditional Execution

Now it may seem conditional execution is a really neat and smart idea, but it turns out the designers of the newer RISC-V instruction set did not like the idea:

Their argument is that by using conditional flags you create a dependency between instructions in the pipeline. E.g. if one instruction sets the conditional execution flag, the follow instruction should perhaps not be executed.

Their argument is that with branch prediction or Out of Order Execution you don’t need conditional execution.

Linus Torvalds Agrees with RISC-V Guys

One of my readers Bhupati Shukla, gave me an interesting follow up to this discussion. Linus Torvalds, the creator of the Linux Operating System, or more specifically the Linux kernel. Turns out there is an email discussion where Linus lets is displeasure with conditional instructions be known. You can read the discussion here. But I am giving the summary. There was a bug with the CMOV instruction and one the participants Alan remarks:

In early PIV days it made sense but on modern processors CMOV is so pointless the bug should be fixed.

This leads to a some discussion about why this is bad, and Linus jumps in to give a more detailed explanation:

CMOV (and, more generically, any “predicated instruction”) tends to
generally a bad idea on an aggressively out-of-order CPU.

You can read what Linus says about it, but I will try to break down what he says using previous code examples. Take our previous example using branching. I am adding some nonsense instruction to it to help explain it better:

    CMP ax, 42   ; CoMPare ax, 42
JE equal ; Jump if Equal to equal
MOV bx, 12 ; bx ← 12
JMP done ; JuMP to done
MOV bx, 33 ; bx ← 33
ADD cx, bx ; cx ← cx + bx, creates dependecy
SUB ax, cx ; Nonsense instructions for demo purposes
ADD cx, 5

If we run this on a CPU with a good branch predictor, it may have figured out that most of the time ax is not equal to 12. The CPU then speculatively executes this sequence of instructions:

MOV bx, 12 
SUB ax, cx
ADD cx, 5

In a superscalar processor with Out-of-Order Execution this is actually very fast to do. Neither of these instruction depend on each other. The calculation of one instruction does not affect the next. Hence all these instructions can be run in parallel which means more stuff gets done per clock cycle, which equal higher performance.

Now if it turns out we predicted wrong, then all this work is wasted. Because we have to run this sequence of instructions instead:

MOV bx, 33
ADD cx, bx ; creates dependency
SUB ax, cx
ADD cx, 5

You can see from this that it would not have been possible to reuse the results from previous calculations. The ADD cx, bx causes the result of SUB ax, cx for instance to change. This a branch misprediction has a cost. But when we are right, there is no overhead. We don’t bare the cost of comparisons and jumps at all. Compare this with a conditional move version:

CMP    ax, 42
CMOVE bx, 33 ; bx ← 33 if ax = 42
CMOVNE bx, 12 ; bx ← 12 if ax ≠ 42
; Since we can't jump to do multiple instruction conditionally
; we do this instead
CMOVE dx, 0
CMOVNE dx, bx
ADD cx, dx
SUB ax, cx
ADD cx, 5

This becomes more convoluted because we have to find a way to emulate this behavior from the earlier version:

MOV bx, 33 ; bx ← 33
ADD cx, bx ; cx ← cx + bx, creates dependecy

We do this by conditionally setting the value of dx. In the equal case it will be set to bx so that ADD cx, dx becomes the same as ADD cx, bx. Otherwise it is set to 0, which means ADD will have no effect.

Here is the problem with the conditional move version. These two instructions cannot be run in parallel since they would cancel each other out.

CMOVE  bx, 33  ; bx ← 33 if ax = 42
CMOVNE bx, 12 ; bx ← 12 if ax ≠ 42

In fact none of these instruction can be run in parallel apart from the last two:

CMOVE  dx, 0
CMOVNE dx, bx
ADD cx, dx
SUB ax, cx
ADD cx, 5

The last two instructions depend on the outcome of ADD cx, dx which again depends on the outcome of the previous CMOV instructions. Hence every time you run through this code you got more code which has to be run in sequence slowing everything down. In the branch version, you can make a prediction and if it right you are able to run more instruction in parallel because you have fewer dependencies. You don’t pay the price of dealing with a condition every time, only the occasional case when you are wrong.

As Linus points out conditional instruction do have value in cases where it is hard to predict the branch. The problem is that a compiler cannot know ahead of time where branches are predictable and unpredictable. If it did, it could put in conditional jumps where branches are predictable and conditional instructions where they are unpredictable and get the best of both worlds.

Since that is not possible we are better off with putting conditional jumps everywhere.

Final Remarks

I actually planned on writing more on this, but I don’t know what other areas are good to focus on. If you are interested in comparison of some more areas, drop me a line. Please note I am not an expert on any of this so I cannot write about too deep technical aspects.

Written by

Geek dad, living in Oslo, Norway with passion for UX, Julia programming, science, teaching, reading and writing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store