ARM vs RISC-V Vector Extensions

A comparison of the RISC-V vector extension (RVV) and ARM scalable vector extension (SVE/SVE2).

Performance increases has been stalling, creating a need to utilize more transistors for parallel processing in different ways, whether multi-core, vector-processing or out-of-order execution.
With Single-Instruction-Multiple-Data (SIMD) instructions unlike normal Single-Instruction-Single-Data (SISD) instructions each instruction (green) processes multiple independent streams of data (blue).

The Problem with ARM Scalable Vector Instructions (SVE)

While researching SVE, it was not obvious to me why I struggled to grasp it, but when picking up my RISC-V book and re-reading the vector-extensions chapter it became clear.

LD1D z1.D, p0/Z, [x1, x3, LSL #3]

The Beauty of RISC-V Vector Instructions

An overview of all the RISC-V Vector extensions instructions (RVV) fit on one book page. There are not many of them, and unlike ARM SVE, the have a pretty simple syntax. Here is a vector load instruction for RISC-V:

VLD v0, x10
LD1 v0.16b, [x10]  # Load 16 byte values at address in x10
LDR d0, [x10]    # Load 64-bit value from address in x10
LD1D z0.b, p0/z, [x10] # Load ? number of byte elements
LD1D z0.d, p0/z, [x10] # Load double word (64-bit) elements

Registers with the Same Name?

The reason for that is that the d and v and z registers are in the same spot. Let me clarify. You got a block of memory called a register file inside every CPU. Or to be more specific, you can have multiple register files in a CPU. The register file is the memory holding the registers. So you don't access memory cells in the register file like regular main memory. Instead you refer to sections of it using register names.

ARM floating point registers are overlapping in the same register file (memory in CPU holding registers).
  • v3 - lowest 128-bit part of z3. A Neon register.
  • d3 - the lowest 64-bit of v3.
  • s3 - lowest 32-bit of d3
  • f0 to f31 scalar floating point registers.
  • v0 to v31 vector registers. Length not in ISA.

ARM vector instruction complexity

I can only scratch the surface of the ARM vector instructions because there is a ton of them. Just locating what is a typical load instruction for Neon and SVE2 was actually quite time consuming. I looked through a lot of ARM documentation and blog entries. Doing the same for RISC-V was trivial. Almost all RISC-V instructions fit on a double sided sheet of paper. There are only three vector load instructions: VLD, VLDS and VLDX.

How ARM and RISC-V Deal with Variable Length Vectors

This is quite an interesting part, because ARM and RISC-V use very different approach and I think the simplicity and flexibility of the RISC-V solution really shines.

RISC-V Variable Length

To begin vector processing you do two things:

  • SETVL - SET Vector Length. Say how many elements you want. There is a max number of elements MVL (max vector length), which you can not exceed.
RISC-V register file can be configured to have fewer than 32 registers which is the max. It can have e.g. 8 registers or 2 registers which are simply larger. Registers can consume all the space of the register file.
Two registers: 512 bytes / 2 = 256 bytes per register
256 bytes / 4 bytes per element = 128 elements

Calculating Max Vector Length (MVL)

Let us look a bit at how this works in practice. The CPU knows of course how large its register file is. The programmer doesn’t know this and is not supposed to either.

LI        x5, 2<<25  # Load register x5 with 2<<25
VSETDCFG x5 # Set data configuration to x5
  • Set element type to 64-bit floating point values
SETVL rd, sr  ; rd ← min(MVL, sr), VL ← rd

ARM Variable Length

With ARM you don’t specifically set vector length. Instead you sort of indirectly set the vector length by using predicate registers. These are bit-masks, which you use to enable and disable elements in a vector register. Predicate registers also exist on RISC-V but don’t have the same central role as on ARM.

WHILELT p3.d, x1, x4
i = 0
while i < M
if x1 < x4
p3[i] = 1
else
p3[i] = 0
end
i += 1
x1 += 1
end
1110000
ADD v4.D, p0/M, v0.D, v1.D ; v4[p0] ← v0[p0] + v1[p0]

DAXPY Code Example

We are going to to look at how this C-function would end up with different vector instructions:

void daxpy(size_t n, double a, double x[], double y[]) {
for (int64_t i = 0; i < n; ++i) {
y[i] = x[i] * a + y[i];
}
}
aX + Y
daxpy(size_t n, double a, double x[], double y[]) n - a0  int   register (alias for x10)
a - fa0 float register (alias for f10)
x - a1 (alias for x11)
y - a2 (alias for x12
    LI       t0, 2<<25
VSETDCFG t0 # enable two 64-bit float regs
loop:
SETVL t0, a0 # t0 ← min(mvl, a0), vl ← t0
VLD v0, a1 # load vector x
SLLI t1, t0, 3 # t1 ← vl * 2³ (in bytes)
VLD v1, a2 # load vector y
ADD a1, a1, t1 # increment pointer to x by vl*8
VFMADD v1, v0, fa0, v1 # v1 += v0 * fa0 (y = a * x + y)
SUB a0, a0, t0 # n -= vl (t0)
VST v1, a2 # store Y
ADD a2, a2, t1 # increment pointer to y by vl*8
BNEZ a0, loop # repeat if n != 0
RET # return
daxpy(size_t n, double a, double x[], double y[]) n - x0  register
a - d0 float register
x - x1 register
y - x2 register

i - x3 register for the loop counter
daxpy:
MOV z2.d, d0 // a
MOV x3, #0 // i
WHILELT p0.d, x3, x0 // i, n
loop:
LD1D z1.d, p0/z, [x1, x3, LSL #3] // load x
LD1D z0.d, p0/z, [x2, x3, LSL #3] // load y
FMLA z0.d, p0/m, z1.d, z2.d
ST1D z0.d, p0, [x2, x3, LSL #3]
INCD x3 // i
WHILELT p0.d, x3, x0 // i, n
B.ANY loop
RET
LD1D z1.d, p0/z, [x1, x3, LSL #3]
[x1, x3, LSL #3] = x1 + x3*2³ = x[i * 8]

Conclusion

As a beginner to vector coding, I must say that ARM is just way too complicated. Not because ARM is bad. I have also looked at Intel AVX instructions and that looks 10x worse. I am most definitely not going to spend time understanding AVX, given the efforts required to grasp SVE and Neon.

Geek dad, living in Oslo, Norway with passion for UX, Julia programming, science, teaching, reading and writing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store