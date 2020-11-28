Why is Apple’s M1 Chip So Fast?

Real world experience with the new M1 Macs have started ticking in. They are fast. Real fast. But why? What is the magic?

On Youtube I watched a Mac user who had bought an iMac last year. It was maxed out with 40 GB of RAM costing him about $4000. He watched in disbelief how his hyper expensive iMac was being demolished by his new M1 Mac Mini, which he had paid a measly $700 for.

In real world test after test, the M1 Macs are not merely inching past top of the line Intel Macs, they are destroying them. In disbelief people have started asking how on earth this is possible?

If you are one of those people, you have come to the right place. Here I plan to break it down into digestible pieces exactly what it is that Apple has done with the M1. Specifically the questions I think a lot of people have are:

What is the technical reasons this M1 chip is so fast? Has Apple made some really exotic technical choices to make this possible? How easy will it be for the competition such as Intel and AMD to pull the same technical tricks?

Sure you could try to Google this, but if you try to learn what Apple has done beyond the superficial explanations, you will quickly get buried in highly technical jargon such as M1 using very wide instruction decoders, enormous re-order buffer (ROB) etc. Unless you are a CPU hardware geek, a lot of this will simply be gobbledegook.

To get the most out of this story I advice reading my earlier piece: RISC and CISC mean in 2020? There I explain what a microprocessor (CPU) is as well various important concepts such as:

Instruction Set Architecture (ISA)

Pipelining

Load / Store Architecture

Microcode vs Micro-operations

But if you are impatient, I will do a quick version of the material you need to understand to grasp my explanation of the M1 chip.

What is a Microprocessor (CPU)?

Normally when speaking of chips from Intel and AMD we talk about central processing units (CPUs) or microprocessors. As you can read more about in my RISC vs CISC story, these pull in instructions from memory. Then each instruction is typically carried out in sequence.

A very basic RISC CPU, not the M1. Instructions are moved from memory along blue arrows into instruction register. There a decoder figures out what the instruction is and enables different parts of the CPU through the red control lines. The ALU adds and subtracts numbers placed in the registers.

A CPU at its most basic level is a device with a number of named memory cells called registers and a number of computational units called arithmetic logic units (ALU). The ALUs perform things like addition, subtraction and other basic math operations. However these are only connected to the CPU registers. If you want to add up two numbers, you have to get those two numbers from memory and into two registers in the CPU.

Here are some examples of typical instructions that a RISC CPU as found on the M1 carries out.

load r1, 150

load r2, 200

add r1, r2

store r1, 310

Here r1 and r2 are the registers I talked about. Modern RISC CPUs cannot do operations on numbers which are not in a register like this. E.g. it cannot add two numbers residing in RAM in two different locations. Instead it has to pull these two numbers into a separate register. That is what we do in this simple example. We pull in the number at memory location 150 in the RAM and put it into register r1 in the CPU. Next we put the contents of address 200 into register r2 . Only then can the numbers be added with the add r1, r2 instruction.

An old mechanical calculator with two registers: the accumulator and input register. Modern CPUs typically have more than a dozen registers, and they are electronic rather than mechanical.

The concept of registers is old. E.g. on this old mechanical calculator, the register is what holds the numbers you are adding. Likely the origin for the word cash register. The register is where you registered input numbers.

The M1 is not a CPU!

But here is a very important thing to understand about the M1:

The M1 is not a CPU, it is a whole system of multiple chips put into one large silicon package. The CPU is just one of these chips.

Basically the M1 is one whole computer onto a chip. The M1 contains CPU, Graphical Processing Unit (GPU), memory, input and output controllers and many more things making up a whole computer. This is what we call a System on a Chip (SoC).

M1 is a System on a Chip. Meaning all the parts making up a computer is place on one silicon chip.

Today if you buy a chip whether from Intel or AMD, you actually get what amounts to multiple microprocessors in one package. In the past computers would have multiple physically separate chips on the motherboard of the computer.

Example of a computer motherboard. Memory, CPU, graphics cards, IO controllers, network card and many other component can be attached to the motherboard to communicate with each other.

However because we are able to put so many transistors on a silicon die today, companies such as Intel and AMD began putting multiple microprocessors onto one chip. Today we refer to these chips as CPU cores. One core is basically a full independent chip which can read instructions from memory and perform calculations.

A microchip with multiple CPU cores.

This has for a long time been the name of the game in terms of increasing performance: Just add more general purpose CPU cores. But there is a disturbance in the force. There is one player in the CPU market which is deviating from this trend.

Apple’s Not So Secret Heterogenous Computing Strategy

Instead of adding ever more general purpose CPU cores, Apple has followed another strategy: They have started adding ever more specialized chips doing a few specialized tasks. The benefit of this is that specialized chips tend to be able to perform their tasks significantly faster using much less electric current than a general purpose CPU core.

This is not entirely new knowledge. For many years already specialized chips such as the graphical processing units (GPUs) have been sitting in Nvidia and AMD graphics cards performing operations related to graphics much faster than general purpose CPUs.

What Apple has done is simply to take a more radical shift towards this direction. Rather than just having general purpose cores and memory, the M1 contains a wide variety of specialized chips:

Central Processing Unit (CPU) — The “brains” of the SoC. Runs most of the code of the operating system and your apps.

Graphics Processing Unit (GPU) — Handles graphics-related tasks, such as visualizing an app’s user interface and 2D/3D gaming.

Image Processing Unit (ISP) — Can be used to speed up common tasks done by image processing aplications.

Digital Signal Processor (DSP) — Handles more mathematically intensive functions than a CPU. Includes decompressing music files.

Neural Processing Unit (NPU) — Used in high-end smartphones to accelerate machine learning (AI) tasks. These include voice recognition and camera processing.

Video encoder/decoder — Handles the power-efficient conversion of video files and formats.

Secure Enclave — Encryption, authentication and security.

Unified memory — Allows the CPU, GPU and other cores to quickly exchange information.

This is part of the reason why a lot of people working on images and video editing with the M1 Macs are seeing such speed improvements. A lot of the tasks they do, can run directly on specialized hardware. That is what allows a cheap M1 Mac Mini to encode a large video file, without breaking sweat while an expensive iMac has all its fans going full blast and still cannot keep up.

In blue you see multiple CPU cores accessing memory, and in green you see large numbers of GPU cores accessing memory.

Unified memory may confuse you. How is it different from shared memory? And wasn’t sharing video memory with main memory a terrible idea in the past giving low performance? Yes, shared memory was indeed bad. The reason was that the CPU and GPU had to take turns accessing the memory. Sharing it meant contention to use the databus. Basically the GPUs and CPUs had to take turns using a narrow pipe to push or pull data through.

That is not the case with Unified memory. In Unified memory the GPU cores and CPU cores can access memory at the same time. Thus in this case there is no overhead in sharing memory. In addition the CPU and GPU can tell each other about where some memory is located. Previously the CPU would have to copy data from its area of the main memory to the area used by the GPU. With unified memory, it is more like saying “Hey Mr. GPU, I got 30 MB of polygon data starting at memory location 2430.” The GPU can then start using that memory without doing any copying.

That means you can significant performance gains by the fact that all the various special co-processors on the M1 can rapidly exchange information with each other by using the same memory pool.

How Mac’s used GPUs before unified memory. There was even an option of having graphics cards outside the computer using a Thunderbolt 3 cable. There is some speculation that this may still be possible in the future.

Why Don’t Intel and AMD Copy This Strategy?

If what Apple is doing is so smart, why are not everybody doing it? To some extent they are. Other ARM chip makers are increasingly putting in specialized hardware.

AMD has also started putting stronger GPUs on some of their chips and moving gradually towards some form of SoC with the accelerated processing units (APU) which are basically CPU cores and GPU cores placed on the same silicon die.

AMD Ryzen Accelerated Processing Unit (APU) which combines CPU and GPU (Radeon Vega) on one silicon chip. Does however not contain other co-processors, IO-controllers or unified memory.

Yet there are important reasons why they cannot do this. An SoC is essentially a whole computer on a chip. That makes it a more natural fit for an actual computer maker, such as HP and Dell. Let me clarify with a silly car-analogy: If your business model is to build and sell car engines, it would be an unusual leap to begin manufacturing and selling whole cars.

For ARM Ltd. in contrast this isn’t an issue. Computer makers such as Dell or HP could simply license ARM intellectual property and buy IP for other chips, to add whatever specialized hardware they think their SoC should have. Next they ship the finished design over over to a semiconductor foundry such as GlobalFoundries or TSMC, which manufactures chips for AMD and Apple today.

TSMC semiconductor foundry in Taiwan. TSMC manufactures chips for other companies such as AMD, Apple, Nvidia and Qualcomm.

Here we get a big problem with the Intel and AMD business model. Their business models are based on selling general purpose CPUs, which people just slot onto a large PC motherboard. Thus computer makers can simply buy motherboards, memory, CPUs and graphics cards from different vendors and integrate them into one solution.

But we are quickly moving away from that world. In the new SoC world you don’t assemble physical components from different vendors. Instead you assemble IP (intellectual property) from different vendors. You buy the design for graphics cards, CPUs, modems, IO controllers and other things from different vendors and use that to design a SoC in-house. Then you get a foundry to manufacture this.

Now you got a big problem, because neither Intel, AMD or Nvidia are going to license their intellectual property to Dell or HP for them to make an SoC for their machines.

Sure Intel and AMD may simply begin to sell whole finished SoCs. But what are these to contain? PC makers may have different ideas of what they should contain. You potentially get a conflict between Intel, AMD, Microsoft and PC makers about what sort of specialized chips should be included because these will need software support.

For Apple this is simple. They control the whole widget. They give you e.g. the Core ML library for developers to write machine learning stuff. Whether Core ML runs on Apple’s CPU or the Neural Engine is an implementation detail developers don’t have to care about.

The Fundamental Challenge of Making Any CPU Run Fast

So heterogenous computing is part of the reason but not the sole reason. The fast general purpose CPU cores on the M1, called Firestorm are genuinely fast. This is a major deviation from ARM CPU cores in the past which tended to be very weak compared to AMD and Intel cores.

Firestorm in contrast beats most Intel cores and almost beats the fastest AMD Ryzen cores. Conventional wisdom said that was not going to happen.

Before talking about what makes Firestorm fast it helps to understand what the core idea of making a fast CPU is really about.

In principle you accomplish in a combination of two strategies:

Perform more instructions in a sequence faster. Perform lots of instructions in parallel.

Back in the 80s, it was easy. Just increase the clock frequency and the instructions would finish faster. Every clock cycle is when the computer does something. But this something can be quite little. Thus an instruction may require multiple clock cycles to finis because it is made up of several smaller tasks.

However today increasing the clock frequency is next to impossible. That is the whole “End of Moore’s Law” that people have been harping on for over a decade now.

Thus it is really about executing as many instructions as possible in parallel.

Multi-Core or Out-of-Order Processors?

There are two approaches to this. One is to add more CPU cores. From the point of view of a software developer it is like adding threads. Every CPU core is like a hardware thread. If you don’t know what a thread is, then you can think of it as the process of carrying out a task. With two cores, a CPU can carry out two separate tasks concurrently: two threads. The tasks could be described as two separate programs stores in memory or it could actually be the same program performed twice. Each thread needs some book-keeping, such as where in sequence of program instructions the thread is currently at. Each thread may store temporary results which should be kept separate.

In principle a processor can have just one core and run multiple threads. In this case it simply halts one thread and stores current progress before switching to another. Later it switches back. This doesn’t bring much of a performance enhancement and is only used when a thread may frequently halt to wait for input from user, data from a slow network connection etc. These may be called software threads. Hardware threads means you have actual extra physical hardware such as extra cores at your disposal to speed up things.

The problem with this is that the developer has to write code to take advantage of this. Some tasks such as sever software is easy to write like this. You can imagine processing each connecting user separate. These tasks are so independent from each other that having lots of cores is an excellent choice for servers especially cloud based services.

The Ampere Altra Max ARM CPU with 128 cores designed for cloud computing, where a lot of hardware threads is a benefit.

That is the reason why you see ARM CPUs makers such as Ampere making CPUs such as the Altra Max which has a crazy 128 cores. This chip is specifically made for the cloud. You don’t need crazy single core performance because in the cloud it is all about having as many threads as possible per watt to handle as many concurrent users as possible.

Apple in contrast is in the complete opposite end of the spectrum. Apple makes single user devices. Lots of threads is not an advantage. Their devices are used for gaming, video editing, development etc. They want desktops with beautiful responsive graphics and animations.

Desktop software is generally not made to utilize lots of cores. E.g. computer game will likely benefit from 8 cores, but something like 128 cores would be a total waste. Instead you would want fewer but more powerful cores.