Part II: Ampere Altra ARM Microprocessor Notes
This is my second installment with notes, factoids and reflections upon the ARM Ampere Altra 80-core CPU. As the previous article, this is not much of a story. I am not building up any particular narrative of punch line. It is just for whomever might be interested in various facts I observed. I hope to write a more comprehensive article on Ampere Altra later.
What is in particular of interest to me is to compare it to sever offerings from Intel, AMD. As well as compare it to stuff from other companies making ARM chips such as Apple and Amazon. It might also be interesting to compare with what the brand new Nuvia startup is building. Their ARM based Phoenix CPU is promising to revolutionize the server market.
As usual I am basing a lot of my note on reading through details provided by AnandTech.
Altra “Quicksilver” Design
Altra is really just the name of a series of chips. The current design is called Quicksilver and features 80 Neoverse-N1 cores. Remember from last time that Neoverse are cores specifically designed for servers. The design was the outcome of a years of cooperation between ARM and Ampere Computing.
The Quicksilver design features up to 80 Neoverse-N1 cores, integrated within an Arm CMN-600 mesh interconnect that features 32MB of distributed system level cache.
What exactly is a mesh interconnect you wonder? Well, whenever you got lots of different units inside a chip you need an interconnect. With just one core, you would not need it. But whenever you got many, you need ways for them to communicate and access shared resources such as cache and memory.
Thus interestingly if you start digging into the information about this, you will find that the inside of a modern microprocessor is surprisingly similar to a computer network. Just like a network can have different topologies and different packet formats and ways of connecting, so can the inside of a CPU. Thus a “Mesh interconnect” is really just a particular form of network topology integrating everything. My understanding is that this is a type optimized for high throughput which which has a fair amount of complexity. The Apple M1 also has something similar called Fabric which connects all the different parts inside the SoC.
Different companies that design intellectual property for chip makers will sell different designs for chips. E.g. in theory you could buy another interconnect from another company to connect your cores, cache, memory controller and whatever else needs to be connected.
One interesting thing about Ampere is that they seem to be able to get higher clock frequency than the competition. AnandTech compares with Graviton2 ARM based solutions from Amazon:
Beyond the higher core-count, what also stands out for the Altra system in comparison to the Graviton2 are the significantly higher clock frequencies up to 3.3GHz for the top SKU, compared to the 2.5GHz of the Amazon chip — a 32% difference that should lead in a corresponding per-core performance advantage for the Ampere system.
However 3.3 GHz would be a lot higher than AMD offerings as well. Their top of the line offering for servers, EPYC-7002, runs standard on 2.6 GHz as base frequency. However it can have a boost up to 3.4 GHz. But boost is not a good comparison as it can to be sustained over long time which is what matters on severs, seen from the providers at least. For customers a boost may be beneficial.
The memory offerings seem to be be comparable to the AMD EPYC-7002. It also has 8 channel DDR4 memory running at 3200MHz.
On a system side, the Altra Quicksilver chip features 8 DDR4–3200 memory controllers for a theoretical peak 204GB/s per socket bandwidth.
You can put two 80-core chips on a board. While it offers more cores than the AMD EPYC, it does not have as good communication between the chips:
Ampere achieves dual-socket connectivity through two dedicated PCIe Gen4 x16 links at 25GT/s featuring CCIX protocol compatibility. The bandwidth here is half of a comparable AMD Rome system which features up to 4x x16 Gen4 links between sockets, and it’s also the first time we’ll be seeing CCIX’s cache coherency capabilities used in this way, so that’s definitely a unique design on the part of the Altra system.
The IO of the Ampere is very similar to AMD EPYC offerings. It also has 128 PCIe lanes as well. And it has the same max memory of 4TB.
Across the board, all SKUs feature full 128x lanes of PCIe I/O connectivity, and the full 8-channel DDR4–3200 memory capabilities, capable of hosting up to 4TB of DRAM on all models without any artificial feature limitations.
AMD seems to be the price leader so let us compare what AnandTech says about AMD first:
An AMD EPYC 7742 with 64 cores and 225W TDP comes in at $6950,
Here Ampere seems to offer a very good value preposition with more cores at lower price:
Ampere’s Q80–33 with 80 cores at a 250W TDP comes a price tag of “only” $4050 seems a steal
Clock Frequency Notes
AnandTech has some interesting notes on clock frequency, because this goes up and down for all severs depending on workload:
For example, a low-IPC high-memory workload on an EPYC 7742 will result in low power consumption on the part of the cores, so the chip will clock them up to 3200MHz on all 64 cores to fill the 225W TDP. A high-IPC workload that stresses the cores and result in higher power might end up with an average runtime frequency of 2600MHz across all cores — but in both cases the average power consumption will always settle around the 225W TDP figure.
Let us unpack this. If every core is busy doing work, an AMD EPYC will run at lower clock frequency. But if only some cores are busy it will increase their clock frequency. Sure that will increase the heath generated from those cores, but because the other cores are generating less heath the total heat produced from the whole chip will sill keep within 225W.
Still from what I can interpret from AnandTech, Ampere has the clear advantage in this area. Their rated Thermal Design Power (TDP) is in practical terms much lower than for x86 chips:
Fundamentally, the Altra’s handling of frequency and power in such a manner is simply a by-product of the Neoverse-N1 cores not being able to clock in higher than 3.3GHz, and the cores being so efficient, that they have power leeway in many workloads, while the x86 player’s implementations simply clock in higher when given the opportunity, because they can — and when in power hungry situations, clocking lower, because they have to.
Ok to reinterpret this. AMD will keep pushing up to 225W heath generating all the time to keep clock frequencies up. While Ampere simply cannot clock higher than 3.3 GHz and thus simply falls ways below 250W heath generating under most workloads.
Comments to AnandTech has some interesting observations:
Each Neoverse N1 core with 1MB L2 is just 1.4mm2, so 80 of them add up to 112mm2. The die size is estimated at about 350mm2, so tiny compared to the total ~1100mm2 in EPYC 7742. So performance/area is >3x that of EPYC. Now that is efficiency!
Although this may simply down to different cache size:
Cache is a big part of the die size for the AMD chip and the N1 has much less of it which makes the die size smaller. AMD’s Desktop IGP parts with way less cache perform very similarly in many workloads to those with the extra cache and the same has been true for intel parts over the years. Some workloads don’t benefit much at all from the extra cache and some do which makes choosing the benchmarks more important.
Although this debate on size goes a lot back and forth in the comment field:
Using Zen 2 is not correct since it uses much larger transistors. Using Kirin 990 5G density gives an estimate of 330mm2 for Graviton 2. The size of N1 cores has been published for 7nm, so we know it is 1.4mm2. You’re right that PCIe lanes would add to it as well — assuming the PHYs have the same size as DDR PHYs at the same speed, 64 lanes would be about 12–15mm2. That would increase it to about 365mm2.
Another commenter Wilco1, makes the case that the Ampere chip in relative terms is in fact not that large. Why does this matter? Because very large chips will have poor yield (due to defects) causing prices to rise.
How exactly is it big? It’s tiny for a server chip — 80 cores at about half the die size of a typical 28-core Xeon (~700mm2). And TSMC 7nm yield is extremely good even for much larger chips like GPUs.
My takeaway thus far before looking closer at the performance is that Ampere has a clear edge. Things like memory and throughput seems pretty standard. But the price, heat production and number of cores seems like a clear win relative to the competition. Especially relative to Intel.