{"id":7813,"date":"2020-12-08T09:25:37","date_gmt":"2020-12-08T08:25:37","guid":{"rendered":"https:\/\/monodes.com\/predaelli\/?p=7813"},"modified":"2020-12-08T09:25:37","modified_gmt":"2020-12-08T08:25:37","slug":"is-apple-m1-the-new-amiga","status":"publish","type":"post","link":"https:\/\/monodes.com\/predaelli\/2020\/12\/08\/is-apple-m1-the-new-amiga\/","title":{"rendered":"Is Apple M1 the new Amiga?"},"content":{"rendered":"<p><span class=\"d2edcug0 hpfvmrgz qv66sw1b c1et5uql gk29lw5a a8c37x1j keod5gw0 nxhoafnm aigsh9s9 d9wwppkn fe6kdd0r mau55g9w c8b282yb hrzyx87i jq4qci2q a3bd9o3v knj5qynh oo9gr5id\" dir=\"auto\">I loved Apple. Not the Apple of DRMs and its golden prison where you can&#8217;t really control\u00a0<strong>your<\/strong> hardware; I loved the Apple that loved Software Libero. Then it mutated into a company that crushes people freedoms while smiling. <\/span><\/p>\n<p><span class=\"d2edcug0 hpfvmrgz qv66sw1b c1et5uql gk29lw5a a8c37x1j keod5gw0 nxhoafnm aigsh9s9 d9wwppkn fe6kdd0r mau55g9w c8b282yb hrzyx87i jq4qci2q a3bd9o3v knj5qynh oo9gr5id\" dir=\"auto\">I like to have control of my hardware. I don&#8217;t want to use hardware that treats me like an enemy as DRM-laded machines do or from a company that is actively trying to kill the idea of Software Libero. <\/span><\/p>\n<p><span class=\"d2edcug0 hpfvmrgz qv66sw1b c1et5uql gk29lw5a a8c37x1j keod5gw0 nxhoafnm aigsh9s9 d9wwppkn fe6kdd0r mau55g9w c8b282yb hrzyx87i jq4qci2q a3bd9o3v knj5qynh oo9gr5id\" dir=\"auto\">So I&#8217;ve mixed reactions when reading about Apple M1. I&#8217;m happy that they find a way to get so good performances but I fear their &#8220;proprietary-ness&#8221;, their total-closure toward other OSes. I would change my mind when Apple will help people port Linux, BSD and other free-as-in-freedom operative systems. I fear I will have to wait for a long time as they have been actively doing <strong>the opposite<\/strong> in recent years.<br \/>\n<\/span><\/p>\n<p><span class=\"d2edcug0 hpfvmrgz qv66sw1b c1et5uql gk29lw5a a8c37x1j keod5gw0 nxhoafnm aigsh9s9 d9wwppkn fe6kdd0r mau55g9w c8b282yb hrzyx87i jq4qci2q a3bd9o3v knj5qynh oo9gr5id\" dir=\"auto\">Today I&#8217;ve read<br \/>\n<\/span><\/p>\n<blockquote>\n<h1><em><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\">Why Is Apple\u2019s M1 Chip So Fast?<\/a><\/em><\/h1>\n<p>Real world experience with the new M1 Macs have started ticking in. They are fast. Real fast. But why? What is the magic?<\/p><\/blockquote>\n<p>There are a couple of passages that striked me:<\/p>\n<div class=\"ecm0bbzt e5nlhep0 a8c37x1j\">\n<div class=\"kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x c1et5uql\">\n<blockquote>\n<ul>\n<li dir=\"auto\">Instead of adding ever more general-purpose CPU cores, Apple has followed another strategy: They have started adding ever more specialized chips doing a few specialized tasks.<br \/>\n&#8230; and<\/li>\n<li dir=\"auto\"><span class=\"d2edcug0 hpfvmrgz qv66sw1b c1et5uql gk29lw5a a8c37x1j keod5gw0 nxhoafnm aigsh9s9 d9wwppkn fe6kdd0r mau55g9w c8b282yb hrzyx87i jq4qci2q a3bd9o3v knj5qynh oo9gr5id\" dir=\"auto\">&#8220;Unified Memory Architecture&#8221; <\/span><\/li>\n<\/ul>\n<\/blockquote>\n<\/div>\n<\/div>\n<div class=\"ecm0bbzt e5nlhep0 a8c37x1j\">\n<div class=\"kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x c1et5uql\">\n<div dir=\"auto\">These features sounds A LOT LIKE the approach of Amiga:<\/div>\n<ul>\n<li dir=\"auto\">Addind specialized chips, Paola, Agnus, Blitter&#8230; shall I go on?<\/li>\n<li dir=\"auto\">UMA, accessible to CPU, GPU and other specialized processing units. The very same concept of &#8220;Chip memory&#8221; in Amiga.<\/li>\n<\/ul>\n<p>Now I <strong>do<\/strong> dearly hope that some company use the same recipe, discovered by Amiga almost 40 years ago.<\/p>\n<p>I hope that the claims made about RISC-V in<\/p>\n<h1 class=\"articleHeader-title\"><a href=\"https:\/\/www.eetimes.com\/micro-magic-risc-v-core-claims-to-beat-apple-m1-and-arm-cortex-a9\/\">Micro Magic RISC-V Core Claims to Beat Apple M1 and Arm Cortex-A9<\/a><\/h1>\n<p>are realistic.<\/p>\n<p>We are living interesing times.<\/p>\n<\/div>\n<\/div>\n<p><!--more--><!--nextpage--><\/p>\n<blockquote>\n<div>\n<h1 id=\"8c58\" class=\"gg gh gi gj b gk gl gm gn go gp gq gr gs gt gu gv gw gx gy gz ha hb hc hd he hf\"><em><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\">Why Is Apple\u2019s M1 Chip So Fast?<\/a><\/em><\/h1>\n<\/div>\n<h2 id=\"d808\" class=\"hg gh gi as b hh hi hj hk hl hm hn ho hp hq hr hs ht hu hv hw fm\">Real-world experience with the new M1 Macs has started ticking in. They are fast. Real fast. But why? What is the magic?<\/h2>\n<div class=\"by\">\n<div class=\"n bn hx hy hz\">\n<div class=\"o n\">\n<div>\n<div class=\"ae ia ib\">\n<div class=\"ic n ao o p fw id ie if ig ih ga\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div>\n<div class=\"ae ia ib\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" class=\"s ii ib ia\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/01Y9ylHZ8csOxgZr7.jpg?resize=28%2C28&#038;ssl=1\" alt=\"Erik Engheim\" width=\"28\" height=\"28\" \/><\/div>\n<\/div>\n<div class=\"cx v n bv\">\n<div class=\"n\">\n<div>\n<h4 class=\"as b at au fn\">Erik Engheim<\/h4>\n<\/div>\n<\/div>\n<h4 class=\"as b at au fm\">Nov 28<span class=\"im\">\u00b7<\/span>21 min read<\/h4>\n<\/div>\n<div class=\"n o\">\n<div class=\"iw s\">\n<div class=\"dd\" aria-hidden=\"false\"><\/div>\n<\/div>\n<\/div>\n<div class=\"jb s\">\n<div>\n<div class=\"ij\">\n<div>\n<div class=\"dd\" role=\"tooltip\" aria-hidden=\"false\" aria-describedby=\"1\" aria-labelledby=\"1\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jc s ap\">\n<div class=\"dd\" aria-hidden=\"false\">\n<div class=\"dd\" aria-hidden=\"false\">\n<div class=\"s dt\"><\/div>\n<\/div>\n<\/div>\n<\/div>\n<section class=\"gb gc gd ck ge\">\n<div class=\"n p\">\n<div class=\"ag ah ai aj ak gf am v\">\n<figure class=\"jg jh ft fu paragraph-image\">\n<div class=\"ji jj ae jk v jl\" tabindex=\"0\" role=\"button\">\n<div class=\"ft fu jf\">\n<div class=\"jq s ae jr\">\n<div class=\"js jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><\/div>\n<p><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1thz7gKafCYxWoA0h8aGjcg.jpeg?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">Image: Apple<\/figcaption><\/figure>\n<p id=\"8eb8\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf kx\" data-selectable-paragraph=\"\"><span class=\"s ky kz la cy lb lc ld le lf ae\">On<\/span> YouTube, I watched a Mac user who had bought an iMac last year. It was maxed out with 40 GB of RAM costing him about $4,000. He watched in disbelief how his hyperexpensive iMac was being demolished by his new M1 Mac Mini, which he had paid a measly $700 for.<\/p>\n<p id=\"1408\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">In real-world test after test, the M1 Macs are not merely inching past top-of-the-line Intel Macs, they are destroying them. In disbelief, people have started asking how on earth this is possible?<\/p>\n<p id=\"71b6\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">If you are one o<span id=\"rmm\">f<\/span> those people, you have come to the right place. Here I plan to break it down into digestible pieces exactly what it is that Apple has done with the M1. Specifically the questions I think a lot of people have are:<\/p>\n<ol class=\"\">\n<li id=\"89bf\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">What are the technical reasons this M1 chip is so fast?<\/li>\n<li id=\"c5cb\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">Has Apple made some really exotic technical choices to make this possible?<\/li>\n<li id=\"03e6\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">How easy will it be for the competition such as Intel and AMD to pull the same technical tricks?<\/li>\n<\/ol>\n<p id=\"6e93\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Sure you could try to Google this, but if you try to learn what Apple has done beyond the superficial explanations, you will quickly get buried in highly technical jargon such as M1 using very wide instruction decoders, enormous reorder buffer (ROB), etc. Unless you are a CPU hardware geek, a lot of this will simply be gobbledygook.<\/p>\n<p id=\"ecc7\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">To get the most out of this story I advise reading my earlier piece: \u201c<a class=\"cr lo\" href=\"https:\/\/medium.com\/swlh\/what-does-risc-and-cisc-mean-in-2020-7b4d42c9a9de\" rel=\"noopener\">What Does RISC and CISC mean in 2020<\/a>?\u201d There I explain what a microprocessor (CPU) is as well as various important concepts such as:<\/p>\n<ul class=\"\">\n<li id=\"4638\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Instruction set architecture (ISA)<\/li>\n<li id=\"e7aa\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Pipelining<\/li>\n<li id=\"69ae\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Load\/store architecture<\/li>\n<li id=\"f37f\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Microcode vs. micro-operations<\/li>\n<\/ul>\n<p id=\"85a3\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">But if you are impatient, I will do a quick version of the material you need to understand to grasp my explanation of the M1 chip.<\/p>\n<h1 id=\"a207\" class=\"lq lr gi as ls lt lu kf lv lw lx ki ly lz ma mb mc md me mf mg mh mi mj mk ml hf\" data-selectable-paragraph=\"\">What is a microprocessor (CPU)?<\/h1>\n<p id=\"285c\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">Normally when speaking of chips from Intel and AMD we talk about central processing units (CPUs) or microprocessors. As you can read more about in my <a class=\"cr lo\" href=\"https:\/\/medium.com\/swlh\/what-does-risc-and-cisc-mean-in-2020-7b4d42c9a9de\" rel=\"noopener\">RISC vs. CISC story<\/a>, these pull in instructions from memory. Then each instruction is typically carried out in sequence.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"n p ab\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/13G7uz4l1GnFacxz6InrLLw.png?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">A very basic RISC CPU, not the M1. Instructions are moved from memory along <strong class=\"as ls\">blue<\/strong> arrows into the instruction register. There a decoder figures out what the instruction is and enables different parts of the CPU through the <strong class=\"as ls\">red<\/strong> control lines. The ALU adds and subtracts numbers placed in the registers.<\/figcaption><\/figure>\n<p id=\"5525\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">A CPU at its most basic level is a device with a number of named memory cells called registers and a number of computational units called arithmetic logic units (ALU). The ALUs perform things like addition, subtraction, and other basic math operations. However, these are only connected to the CPU registers. If you want to add up two numbers, you have to get those two numbers from memory and into two registers in the CPU.<\/p>\n<p id=\"db6e\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Here are some examples of typical instructions that a RISC CPU as found on the M1 carries out.<\/p>\n<pre class=\"mr ms mt mu mv mx my mz\"><span id=\"9500\" class=\"hf na lr gi nb b cd nc nd s ne\" data-selectable-paragraph=\"\">load r1, 150\nload r2, 200\nadd  r1, r2\nstore r1, 310<\/span><\/pre>\n<p id=\"d5bb\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Here <code class=\"\" data-line=\"\">r1<\/code> and <code class=\"\" data-line=\"\">r2<\/code> are the registers I talked about. Modern RISC CPUs cannot do operations on numbers that are not in a register like this. For example, it cannot add two numbers residing in RAM in two different locations. Instead, it has to pull these two numbers into a separate register. That is what we do in this simple example. We pull in the number at memory location 150 in the RAM and put it into register <code class=\"\" data-line=\"\">r1<\/code> in the CPU. Next, we put the contents of address 200 into register <code class=\"\" data-line=\"\">r2<\/code>. Only then can the numbers be added with the <code class=\"\" data-line=\"\">add r1, r2<\/code> instruction.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu ni\">\n<div class=\"jq s ae jr\">\n<div class=\"nj jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1kDqSqtumOzNFZdpixUW0IQ.jpeg?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">An old mechanical calculator with two registers: the accumulator and input register. Modern CPUs typically have more than a dozen registers, and they are electronic rather than mechanical.<\/figcaption><\/figure>\n<p id=\"7146\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The concept of registers is old. For example, on this old mechanical calculator, the <em class=\"nk\">register<\/em> is what holds the numbers you are adding. Likely the origin of the term <em class=\"nk\">cash register<\/em>. The register is where you registered input numbers.<\/p>\n<h1 id=\"22f2\" class=\"lq lr gi as ls lt lu kf lv lw lx ki ly lz ma mb mc md me mf mg mh mi mj mk ml hf\" data-selectable-paragraph=\"\">The M1 is not a CPU!<\/h1>\n<p id=\"d450\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">But here is a very important thing to understand about the M1:<\/p>\n<p id=\"777f\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The M1 is not a CPU, it is a whole system of multiple chips put into one large silicon package. The CPU is just one of these chips.<\/p>\n<p id=\"69f0\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Basically, the M1 is one whole computer onto a chip. The M1 contains a CPU, graphical processing unit (GPU), memory, input and output controllers, and many more things making up a whole computer. This is what we call a system on a chip (SoC).<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ji jj ae jk v jl\" tabindex=\"0\" role=\"button\">\n<div class=\"ft fu nl\">\n<div class=\"jq s ae jr\">\n<div class=\"nm jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1c4EYUAVj4k7n6wWLoWUVdA.png?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">M1 is a system on a chip. Meaning all the parts making up a computer are placed on one silicon chip.<\/figcaption><\/figure>\n<p id=\"fce1\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Today if you buy a chip \u2014 whether from Intel or AMD \u2014 you actually get what amounts to <em class=\"nk\">multiple<\/em> microprocessors in one package. In the past computers would have multiple physically separate chips on the motherboard of the computer.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu nn\">\n<div class=\"jq s ae jr\">\n<div class=\"no jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1GBrG4D2YCEVYREXAHIiJnQ.png?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">Example of a computer motherboard. Memory, CPU, graphics cards, IO controllers, network cards, and many other components can be attached to the motherboard to communicate with each other.<\/figcaption><\/figure>\n<p id=\"9cf5\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">However because we are able to put so many transistors on a silicon die today, companies such as Intel and AMD began putting multiple microprocessors onto one chip. Today we refer to these chips as CPU cores. One core is basically a full independent chip that can read instructions from memory and perform calculations.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu np\">\n<div class=\"jq s ae jr\">\n<div class=\"nq jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><\/div>\n<p><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1XZsBJV_v4WybnUFYjJJiPQ.gif?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">A microchip with multiple CPU cores.<\/figcaption><\/figure>\n<p id=\"18a4\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">This has for a long time been the name of the game in terms of increasing performance: Just add more general-purpose CPU cores. But there is a disturbance in the force. There is one player in the CPU market which is deviating from this trend.<\/p>\n<h2 id=\"cff4\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">Apple\u2019s not so secret heterogeneous computing strategy<\/h2>\n<p id=\"8feb\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">Instead of adding ever more general-purpose CPU cores, Apple has followed another strategy: They have started adding ever more specialized chips doing a few specialized tasks. The benefit of this is that specialized chips tend to be able to perform their tasks significantly faster using much less electric current than a general-purpose CPU core.<\/p>\n<p id=\"e379\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">This is not entirely new knowledge. For many years already specialized chips such as the graphical processing units (GPUs) have been sitting in Nvidia and AMD graphics cards performing operations related to graphics much faster than general-purpose CPUs.<\/p>\n<p id=\"866e\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">What Apple has done is simply to take a more radical shift toward this direction. Rather than just having general-purpose cores and memory, the M1 contains a wide variety of specialized chips:<\/p>\n<ul class=\"\">\n<li id=\"6e0e\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Central processing unit (CPU) \u2014 the \u201cbrains\u201d of the SoC. Runs most of the code of the operating system and your apps.<\/li>\n<li id=\"037c\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Graphics processing unit (GPU) \u2014 handles graphics-related tasks, such as visualizing an app\u2019s user interface and 2D\/3D gaming.<\/li>\n<li id=\"9a44\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Image processing unit (ISP) \u2014 can be used to speed up common tasks done by image processing applications.<\/li>\n<li id=\"c654\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Digital signal processor (DSP) \u2014 handles more mathematically intensive functions than a CPU. Includes decompressing music files.<\/li>\n<li id=\"1cac\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Neural processing unit (NPU) \u2014 used in high-end smartphones to accelerate machine learning (A.I.) tasks. These include voice recognition and camera processing.<\/li>\n<li id=\"ce8d\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Video encoder\/decoder \u2014 handles the power-efficient conversion of video files and formats.<\/li>\n<li id=\"d9a8\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Secure Enclave \u2014 encryption, authentication, and security.<\/li>\n<li id=\"2d2b\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Unified memory \u2014 allows the CPU, GPU, and other cores to quickly exchange information.<\/li>\n<\/ul>\n<p id=\"967a\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">This is part of the reason why a lot of people working on images and video editing with the M1 Macs are seeing such speed improvements. A lot of the tasks they do can run directly on specialized hardware. That is what allows a cheap M1 Mac Mini to encode a large video file without breaking a sweat while an expensive iMac has all its fans going full blast and still cannot keep up.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu nz\">\n<div class=\"jq s ae jr\">\n<div class=\"js jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><\/div>\n<p><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1SpFJ9g-taH9a-TvXdnaq2w.png?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">In <strong class=\"as ls\">blue<\/strong> you see multiple CPU cores accessing memory, and in <strong class=\"as ls\">green<\/strong> you see large numbers of GPU cores accessing memory.<\/figcaption><\/figure>\n<h2 id=\"d906\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">What is Special About Apple\u2019s Unified Memory Architecture?<\/h2>\n<p id=\"956a\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">Apple\u2019s \u201cUnified Memory Architecture\u201d (UMA) is a bit tricky to wrap your head around (I got it wrong first time I wrote it down here).<\/p>\n<p id=\"8cb3\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">To explain why, we need to take a few steps back.<\/p>\n<p id=\"b7a7\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">For a long time cheap computer systems have had the CPU and GPU integrated into the same chip (same silicon die). These have been famously slow. In the past saying \u201cintegrated graphics\u201d was essentially the same as saying \u201cslow graphics.\u201d<\/p>\n<p id=\"a598\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">These where slow for severals reasons:<\/p>\n<p id=\"1d41\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Separate areas of this memory got reserved for the CPU and GPU. If the CPU had a chunk of data it wanted the GPU to use, it couldn\u2019t say \u201chere have some of my memory.\u201d No, the CPU had to explicitly copy the whole chunk of data over the memory area controlled by the GPU.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu oa\">\n<div class=\"jq s ae jr\">\n<div class=\"ob jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><\/div>\n<p><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/17c67HhuDWVFM3Pi-mO5boA.jpeg?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">CPUs don\u2019t need a lot of data served, but they want it <strong class=\"as ls\">fast<\/strong>.<\/figcaption><\/figure>\n<p id=\"f669\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">CPUs and GPUs don\u2019t want their memory served the same way. Let us do a silly food analogy: CPUs want their plate of data served very quickly by the waiter, but they are totally cool with small portion sizes. Imagine a fancy French restaurant with waiters on rollerblades to serve you really quickly.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu oc\">\n<div class=\"jq s ae jr\">\n<div class=\"od jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><\/div>\n<p><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1ethYaJsPETw2zxF7Xm0Z0w.jpeg?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">This is how your GPU wants their memory: <strong class=\"as ls\">huge<\/strong> portions. The more the merrier.<\/figcaption><\/figure>\n<p id=\"9c14\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">GPUs in contrast are cool with the waiter being slow to serve the data. But the GPUs want enormous servings. They gobble massive amounts of data because they are massive parallel machines, that can chew through lots of data in parallel. Imagine an American junk food place, where the food takes some time to arrive because they are pushing a whole trolley of food to your seating area.<\/p>\n<p id=\"812f\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">With such different needs, putting CPUs and GPUs on the same physical chip was not a great idea. The GPUs would sit there starving while given small French servings. The result was that there was no point in putting powerful GPUs on an SoC. The tiny portions of data served up, could easily be chewed up by a weak little GPU.<\/p>\n<p id=\"d539\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The second problem was that large GPUs produce a lot of heat and thus you cannot integrate them with the CPU without getting problems ridding yourself of the heat produced. Thus discrete graphics cards tend to look like the one below: Large beasts with massive cooling fans. They have special dedicated memory designed to serve the greedy cards massive amounts of data.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu oe\">\n<div class=\"jq s ae jr\">\n<div class=\"of jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><\/div>\n<p><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1gQcbAlaUXNtjJi6OAUzMsg.jpeg?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">GeForce RTX 3080<\/figcaption><\/figure>\n<p id=\"14fa\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">That is why these cards have high performance. But they have an achilles heel: Whenever they have to get data from the memory used by the CPU, this happens over a set of copper traces on the computer motherboard called a PCIe bus. Try chugging water through a super thin straw. It may get to your mouth fast, but the throughput is totally inadequate.<\/p>\n<p id=\"5fca\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Apple\u2019s <em class=\"nk\">Unified Memory Architecture<\/em> tries to solve all these problems without having the disadvantages of old school shared memory. They achieve this in the following ways:<\/p>\n<ol class=\"\">\n<li id=\"dade\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">There is no special area reserved just for the CPU or just the GPU. Memory is allocated to both processors. They can both use the same memory. No copying is needed.<\/li>\n<li id=\"4c61\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">Apple uses memory which serves both large chunks of data and serves it fast. In computer speak that is called low latency and high throughput. Thus the need to be connected to separate types of memory is removed.<\/li>\n<li id=\"2508\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">Apple has gotten the watt usage of the GPU down, so that a relatively powerful GPU can be integrated without overheating the SoC.<\/li>\n<\/ol>\n<p id=\"ba08\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Some will say unified memory is not entirely new. It is true that different systems have had it in the past. But then the difference in memory requirements may not have been as large. Secondly what Nvidia calls Unified Memory is not really the same thing. In the Nvidea world Unified Memory simply means that there is software and hardware which takes care of automatically copying data back and forth between the separate CPU and GPU memory. Thus from a programmers perspective Apple and Nvidia Unified Memory may look the same, but it is not the same in a physical sense.<\/p>\n<p id=\"2fcb\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">There is of course a tradeoff in this strategy. Getting this high bandwidth memory (big servings) require full integration which means you take away the opportunity from customers to upgrade their memory. But Apple seeks to minimize this problem by making the communication with the SSD disks so fast, that they essentially work like old fashion memory.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ji jj ae jk v jl\" tabindex=\"0\" role=\"button\">\n<div class=\"ft fu og\">\n<div class=\"jq s ae jr\">\n<div class=\"oh jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1IJFHpc1CrblUt09PSzaTyg.png?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">How Mac\u2019s used GPUs before unified memory. There was even an option of having graphics cards outside the computer using a Thunderbolt 3 cable. There is some speculation that this may still be possible in the future.<\/figcaption><\/figure>\n<h2 id=\"d329\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">If SoCs Are So Smart, Why Don\u2019t Intel and AMD Copy This Strategy?<\/h2>\n<p id=\"f5e0\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">If what Apple is doing is so smart, why is not everybody doing it? To some extent they are. Other ARM chip makers are increasingly putting in specialized hardware.<\/p>\n<p id=\"2bdd\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">AMD has also started putting stronger GPUs on some of their chips and moving gradually toward some form of SoC with the accelerated processing units (APU) which are basically CPU cores and GPU cores placed on the same silicon die.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ji jj ae jk v jl\" tabindex=\"0\" role=\"button\">\n<div class=\"ft fu oi\">\n<div class=\"jq s ae jr\">\n<div class=\"oj jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1vutPH0zsrwSzGgFWomL61A.jpeg?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">AMD Ryzen Accelerated Processing Unit (APU) which combines CPU and GPU (Radeon Vega) on one silicon chip. Does however not contain other co-processors, IO-controllers, or unified memory.<\/figcaption><\/figure>\n<p id=\"279e\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Yet there are important reasons why they cannot do this. An SoC is essentially a whole computer on a chip. That makes it a more natural fit for an actual computer-maker, such as HP and Dell. Let me clarify with a silly car analogy: If your business model is to build and sell car engines, it would be an unusual leap to begin manufacturing and selling whole cars.<\/p>\n<p id=\"04e1\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">For ARM, in contrast, this isn\u2019t an issue. Computer makers such as Dell or HP could simply license ARM intellectual property and buy IP for other chips, to add whatever specialized hardware they think their SoC should have. Next, they ship the finished design over to a semiconductor foundry such as <a class=\"cr lo\" href=\"https:\/\/en.wikipedia.org\/wiki\/GlobalFoundries\" rel=\"noopener nofollow\">GlobalFoundries<\/a> or <a class=\"cr lo\" href=\"https:\/\/www.tsmc.com\/english\" rel=\"noopener nofollow\">TSMC<\/a>, which manufactures chips for AMD and Apple today.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ji jj ae jk v jl\" tabindex=\"0\" role=\"button\">\n<div class=\"ft fu ok\">\n<div class=\"jq s ae jr\">\n<div class=\"ol jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1d88lw8YMonKgdDWQLC1jXQ.jpeg?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">TSMC semiconductor foundry in Taiwan. TSMC manufactures chips for other companies such as AMD, Apple, Nvidia, and Qualcomm.<\/figcaption><\/figure>\n<p id=\"a601\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Here we get a big problem with the Intel and AMD business model. Their business models are based on selling general-purpose CPUs, which people just slot onto a large PC motherboard. Thus computer-makers can simply buy motherboards, memory, CPUs, and graphics cards from different vendors and integrate them into one solution.<\/p>\n<p id=\"7f77\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">But we are quickly moving away from that world. In the new SoC world, you don\u2019t assemble physical components from different vendors. Instead, you assemble IP (intellectual property) from different vendors. You buy the design for graphics cards, CPUs, modems, IO controllers, and other things from different vendors and use that to design an SoC in-house. Then you get a foundry to manufacture this.<\/p>\n<p id=\"2fba\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Now you got a big problem, because neither Intel, AMD, or Nvidia are going to license their intellectual property to Dell or HP for them to make an SoC for their machines.<\/p>\n<p id=\"b24c\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Sure Intel and AMD may simply begin to sell whole finished SoCs. But what are these to contain? PC-makers may have different ideas of what they should contain. You potentially get a conflict between Intel, AMD, Microsoft, and PC-makers about what sort of specialized chips should be included because these will need software support.<\/p>\n<p id=\"558d\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">For Apple this is simple. They control the whole widget. They give you, for example, the Core ML library for developers to write <a class=\"cr lo\" href=\"https:\/\/developer.apple.com\/machine-learning\/\" rel=\"noopener nofollow\">machine learning<\/a> stuff. Whether Core ML runs on Apple\u2019s CPU or the Neural Engine is an implementation detail developers don\u2019t have to care about.<\/p>\n<h2 id=\"711f\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">The fundamental challenge of making any CPU run fast<\/h2>\n<p id=\"f11c\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">So heterogeneous computing is part of the reason but not the sole reason. The fast general-purpose CPU cores on the M1, called Firestorm, are genuinely fast. This is a major deviation from ARM CPU cores in the past which tended to be very weak compared to AMD and Intel cores.<\/p>\n<p id=\"b74c\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Firestorm, in contrast, beats most Intel cores and almost beats the fastest AMD Ryzen cores. Conventional wisdom said that was not going to happen.<\/p>\n<p id=\"ab49\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Before talking about what makes Firestorm fast it helps to understand what the core idea of making a fast CPU is really about.<\/p>\n<p id=\"92fc\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">In principle you accomplish in a combination of two strategies:<\/p>\n<ol class=\"\">\n<li id=\"68be\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">Perform more instructions in a sequence faster.<\/li>\n<li id=\"5f98\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">Perform lots of instructions in parallel.<\/li>\n<\/ol>\n<p id=\"bd92\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Back in the \u201980s, it was easy. Just increase the clock frequency and the instructions would finish faster. Every clock cycle is when the computer does something. But this <em class=\"nk\">something<\/em> can be quite little. Thus an instruction may require multiple clock cycles to finish because it is made up of several smaller tasks.<\/p>\n<p id=\"905f\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">However, today increasing the clock frequency is next to impossible. That is the whole \u201cEnd of Moore\u2019s Law\u201d that people have been harping on for over a decade now.<\/p>\n<p id=\"2133\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Thus it is really about executing as many instructions as possible in parallel.<\/p>\n<h2 id=\"1f1b\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">Multi-core or Out-of-Order processors?<\/h2>\n<p id=\"9966\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">There are two approaches to this. One is to add more CPU cores. From the point of view of a software developer, it is like adding <em class=\"nk\">threads<\/em>. Every CPU core is like a hardware thread. If you don\u2019t know what a thread is, then you can think of it as the process of carrying out a task. With two cores, a CPU can carry out two separate tasks concurrently: two threads. The tasks could be described as two separate programs stores in memory or it could actually be the same program performed twice. Each thread needs some bookkeeping, such as <em class=\"nk\">where<\/em> in a sequence of program instructions the thread is currently at. Each thread may store temporary results which should be kept separate.<\/p>\n<p id=\"8059\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">In principle, a processor can have just one core and run multiple threads. In this case, it simply halts one thread and stores current progress before switching to another. Later it switches back. This doesn\u2019t bring much of a performance enhancement and is only used when a thread may frequently halt to wait for input from the user, data from a slow network connection, etc. These may be called software threads. Hardware threads mean you have actual extra physical hardware such as extra cores at your disposal to speed up things.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu nn\">\n<div class=\"jq s ae jr\">\n<div class=\"om jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/12mDUgCX9a49EldiCTbL6fA.png?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/figure>\n<p id=\"3d7f\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The problem with this is that the developer has to write code to take advantage of this. Some tasks such as sever software are easy to write like this. You can imagine processing each connecting user separately. These tasks are so independent from each other that having lots of cores is an excellent choice for servers especially cloud-based services.<\/p>\n<figure class=\"mr ms mt mu mv jh ft fu paragraph-image\">\n<div class=\"ft fu on\">\n<div class=\"jq s ae jr\">\n<div class=\"oo jt s\">\n<div class=\"ep jm fw fj fg jn v el jo jp\"><\/div>\n<p><a href=\"https:\/\/debugger.medium.com\/why-is-apples-m1-chip-so-fast-3262b158cba2\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/1X5ZTLTHUYBdw0t25TemHcQ.png?w=910&#038;ssl=1\" alt=\"\" \/><\/a><\/div>\n<\/div>\n<\/div><figcaption class=\"jx jy fv ft fu jz ka as b at au fm\" data-selectable-paragraph=\"\">The Ampere Altra Max ARM CPU with 128 cores designed for cloud computing, where a lot of hardware threads is a benefit.<\/figcaption><\/figure>\n<p id=\"f30f\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">That is the reason why you see ARM CPU-makers such as Ampere making CPUs such as the <a class=\"cr lo\" href=\"https:\/\/www.networkworld.com\/article\/3564514\/ampere-announces-128-core-arm-server-processor.html\" rel=\"noopener nofollow\">Altra Max<\/a> which has a crazy 128 cores. This chip is specifically made for the cloud. You don\u2019t need crazy single-core performance because in the cloud it is all about having as many threads as possible per watt to handle as many concurrent users as possible.<\/p>\n<p id=\"6ead\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Apple, in contrast, is on the complete opposite end of the spectrum. Apple makes single-user devices. Lots of threads is not an advantage. Their devices are used for gaming, video editing, development, etc. They want desktops with beautiful responsive graphics and animations.<\/p>\n<p id=\"e3ca\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Desktop software is generally not made to utilize lots of cores. For example, computer games will likely benefit from eight cores, but something like 128 cores would be a total waste. Instead, you would want fewer but more powerful cores.<\/p>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p by op oq or\" role=\"separator\"><\/div>\n<section class=\"gb gc gd ck ge\">\n<div class=\"n p\">\n<div class=\"ag ah ai aj ak gf am v\">\n<p id=\"156c\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">So here is the interesting thing, <a class=\"cr lo\" href=\"https:\/\/en.wikipedia.org\/wiki\/Out-of-order_execution\" rel=\"noopener nofollow\">Out-of-Order execution<\/a> is a way to execute more instructions in parallel but without exposing that capability as multiple threads. Developers don\u2019t have to code their software specifically to take advantage of it. Seen from the developer\u2019s perspective it just looks like each core runs faster.<\/p>\n<p id=\"2402\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">To understand how this works, you need to understand some things about memory. Asking for data in one particular memory location is slow. But there is no difference in the delay getting 1 byte compared to getting say 128 bytes. Data is sent across what we call a databus. You can think of it as a road or pipe between memory and different parts of the CPU where data gets pushed through. In reality, it is of course just some copper tracks conducting electricity. If the databus is wide enough you can just get multiple bytes at the same time.<\/p>\n<p id=\"2053\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Thus CPUs get a whole chunk of instructions at a time to execute. But they are written to be executed one after the other. Modern microprocessors do what we call Out-of-Order (OoO) execution.<\/p>\n<p id=\"de65\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">That means they are able to analyze a buffer of instructions quickly and see which ones depend on which. Look at the simple example below:<\/p>\n<pre class=\"mr ms mt mu mv mx my mz\"><span id=\"4de2\" class=\"hf na lr gi nb b cd nc nd s ne\" data-selectable-paragraph=\"\">01: mul r1, r2, r3    \/\/ r1 \u2190 r2 \u00d7 r3\n02: add r4, r1, 5     \/\/ r4 \u2190 r1 + 5\n03: add r6, r2, 1     \/\/ r6 \u2190 r2 + 1<\/span><\/pre>\n<p id=\"d5a8\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Multiplication tends to be a slow process. So say it takes multiple clock cycles to perform. The second instruction will simply have to wait because its calculation depends on knowing the result that gets put into the <code class=\"\" data-line=\"\">r1<\/code> register.<\/p>\n<p id=\"d3b8\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">However, the third instruction at line <code class=\"\" data-line=\"\">03<\/code> doesn\u2019t depend on calculations from previous instructions. Hence an Out-of-Order processor can begin calculating this instruction in parallel.<\/p>\n<p id=\"deba\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">However more realistically we are talking about hundreds of instructions. The CPU is able to figure out all the dependencies between these instructions.<\/p>\n<p id=\"b30b\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">It analyses the instructions by looking at the inputs to each instruction. Do the inputs depend on output from one or more other instructions? By input and output, we mean registers containing results from previous calculations.<\/p>\n<p id=\"d5b1\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">For example, the <code class=\"\" data-line=\"\">add r4, r1, 5<\/code> instruction depends on input from <code class=\"\" data-line=\"\">r1<\/code> which is produced by <code class=\"\" data-line=\"\">mul r1, r2, r3<\/code> . We can chain together these relationships into long elaborate graphs that the CPU can work through. The nodes are the instructions and the edges are the registers connecting them.<\/p>\n<p id=\"621d\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The CPU can analyze such a graph of nodes and determine which instructions it can perform in parallel and where it needs to wait for the results from multiple dependent calculations before carrying on.<\/p>\n<p id=\"46ee\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Many instructions will finish early but we cannot make their results official. We cannot commit them; otherwise, we supply the result in the wrong order. To the rest of the world, it has to look as if the instructions were carried out in the same sequence as they were issued.<\/p>\n<p id=\"8815\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Like a stack, the CPU will keep popping done instructions from the top, until hitting an instruction that is not done.<\/p>\n<p id=\"0770\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">We are not quite done with this explanation, but this gives you a bit of a clue. Basically, you can have parallelism that the programmer must know or the kind which the CPU fakes to look as if everything is a single thread. However, behind the scenes, it is doing Out-of-Order black magic.<\/p>\n<p id=\"01ee\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">It is the superior Out-of-Order execution that is making the Firestorm cores on the M1 kick ass and take names. It is in fact much stronger than anything from Intel or AMD. Likely stronger than anybody else in the mainstream market.<\/p>\n<h2 id=\"52fe\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">Why is AMD and Intel Out-of-Order execution inferior to M1?<\/h2>\n<p id=\"c057\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">In my explanation of Out-of-Order execution (OoO) I skipped some important details, which need to be covered. Otherwise, it is not possible to understand why Apple is ahead of the game and Intel and AMD may not be able to catch up.<\/p>\n<p id=\"700a\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The big \u201cscratchpad\u201d I talked about is actually called the <em class=\"nk\">Reorder Buffer<\/em> (ROB), and it doesn\u2019t contain normal machine code instructions. Not the ones that the CPU fetches from memory to execute. These are the instructions in the CPU Instruction Set Architecture (ISA). That is the kind of instructions that we call x86, ARM, PowerPC, etc.<\/p>\n<p id=\"6255\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">However internally the CPU works on an entirely different instruction set invisible to the programmer. We call these micro-operations (micro-ops or \u03bcops). The ROB is full of these micro-ops.<\/p>\n<p id=\"c1bc\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">These are much more practical to work with for all the magic a CPU does to make stuff run in parallel. The reason is that micro-ops are very wide (contain a lot of bits) and can contain all sorts of meta-information. You cannot add that kind of information to an ARM or x86 instruction as it would:<\/p>\n<ol class=\"\">\n<li id=\"91c4\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">Totally bloat the program binaries.<\/li>\n<li id=\"255c\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">Expose details about <em class=\"nk\">how<\/em> the CPU works, whether it has an OoO unit, has register renaming, and many other details.<\/li>\n<li id=\"1ab5\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lg lh li hf\" data-selectable-paragraph=\"\">A lot of the meta-information only makes sense in the context of our current execution.<\/li>\n<\/ol>\n<p id=\"e439\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">You can think of this as when writing a program. You have a public API that needs to be stable and everybody uses. That is the ARM, x86, PowerPC, MIPS, etc. instruction sets. The micro-ops are basically the private APIs that are used to implement the public ones.<\/p>\n<p id=\"3180\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Also, micro-ops are usually easier to work with for the CPU. Why? Because they each do one simple limited task. Regular ISA instructions can be more complex causing a bunch of stuff to happen and thus actually translate to multiple micro-ops.<\/p>\n<p id=\"ff1f\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">For CISC CPUs there is usually no alternative but to use micro-ops otherwise the large complex CISC instructions would make pipelines and OoO next to impossible to achieve.<\/p>\n<p id=\"9da7\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">RISC CPUs have a choice. So, for example, smaller ARM CPUs don\u2019t use micro-ops at all. But that also means they cannot do things such as OoO.<\/p>\n<\/div>\n<\/div>\n<\/section>\n<div class=\"n p by op oq or\" role=\"separator\"><\/div>\n<section class=\"gb gc gd ck ge\">\n<div class=\"n p\">\n<div class=\"ag ah ai aj ak gf am v\">\n<p id=\"cd20\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">But you wonder why does any of this matter? Why is this detail important to know to understand why Apple has the upper hand on AMD and Intel?<\/p>\n<p id=\"8286\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">It is because the ability to run fast depends on how quickly you can fill up the ROB with micro-ops and with how many. The more quickly you fill it up and the larger it is the more opportunities you are given to pick instructions you can execute in parallel and thus improve performance.<\/p>\n<p id=\"d0d8\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Machine code instructions are chopped into micro-ops by what we call an instruction decoder. If we have more decoders we can chop up more instructions in parallel and thus fill up the ROB faster.<\/p>\n<p id=\"8622\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">And this is where we see the huge differences. The biggest, baddest Intel and AMD microprocessor cores have four decoders, which means they can decode four instructions in parallel spitting out micro-ops.<\/p>\n<p id=\"2515\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">But Apple has a crazy eight decoders. Not only that but the ROB is something like three times larger. You can basically hold three times as many instructions. No other mainstream chipmaker has that many decoders in their CPUs.<\/p>\n<h2 id=\"59ff\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">Why can\u2019t Intel and AMD add more instruction decoders?<\/h2>\n<p id=\"4735\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">This is where we finally see the revenge of RISC, and where the fact that the M1 Firestorm core has an ARM RISC architecture begins to matter.<\/p>\n<p id=\"fca4\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">You see, for x86 an instruction can be anywhere from 1\u201315 bytes long. On a RISC chip instructions are fixed size. Why is that relevant in this case?<\/p>\n<p id=\"d102\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Because splitting up a stream of bytes into instructions to feed into eight different decoders in parallel becomes trivial if every instruction has the same length.<\/p>\n<p id=\"b4e4\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">However, on an x86 CPU, the decoders have no clue <em class=\"nk\">where<\/em> the next instruction starts. It has to actually analyze each instruction in order to see how long it is.<\/p>\n<p id=\"e842\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The brute force way Intel and AMD deal with this is by simply attempting to decode instructions at every possible starting point. That means we have to deal with lots of wrong guesses and mistakes which has to be discarded. This creates such a convoluted and complicated decoder stage that it is really hard to add more decoders. But for Apple, it is trivial in comparison to keep adding more.<\/p>\n<p id=\"499d\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">In fact, adding more causes so many other problems that four decoders according to AMD itself is basically an upper limit for how far they can go.<\/p>\n<p id=\"bb2e\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">This is what allows the M1 Firestorm cores to essentially process twice as many instructions as AMD and Intel CPUs at the same clock frequency.<\/p>\n<p id=\"327e\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">One could argue as a counterpoint that CISC instructions turn into more micro-ops, that they are denser so that, for example, decoding one x86 instruction is more similar to decoding say two ARM instructions.<\/p>\n<p id=\"b2f1\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">Except this is not the case in the real world. Highly optimized x86 code rarely uses complex CISC instructions. In some regards, it has a RISC flavor.<\/p>\n<p id=\"8792\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">But that doesn\u2019t help Intel or AMD, because even if those 15 byte long instructions are rare, the decoders have to be made to handle them. This incurs complexity that blocks AMD and Intel from adding more decoders.<\/p>\n<h2 id=\"26e4\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">But AMDs Zen3 cores are still faster right?<\/h2>\n<p id=\"7106\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">As far as I remember from performance benchmarks the newest AMD CPU cores, the ones called Zen3 are slightly faster than Firestorm cores. But here is the kicker, that only happens because the Zen3 cores are clocked at 5 GHz. Firestorm cores are clocked at 3.2 GHz. The Zen3 is just barely squeezing past Firestorm despite having almost 60% higher clock frequency.<\/p>\n<p id=\"da9f\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">So why doesn\u2019t Apple increase the clock frequency too? Because higher clock frequency makes the chips hotter. That is one of Apple\u2019s key selling points. Their computers \u2014 unlike Intel and AMD offerings \u2014 barely need cooling.<\/p>\n<p id=\"f640\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">In essence, one could say Firestorm cores really are superior to Zen3 cores. Zen3 only manages to stay in the game by drawing a lot more current and getting a lot hotter. Something Apple simply chooses not to do.<\/p>\n<p id=\"6302\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">If Apple wants higher performance they are simply going to add more cores. That lets them keep watt usage down while offering more performance.<\/p>\n<h2 id=\"e95d\" class=\"na lr gi as ls nr ns hj lv nt nu hm ly hn nv hp mc hq nw hs mg ht nx hv mk ny hf\" data-selectable-paragraph=\"\">The future<\/h2>\n<p id=\"9450\" class=\"kb kc gi kd b hh mm kf kg hk mn ki kj kk mo km kn ko mp kq kr ks mq ku kv kw gb hf\" data-selectable-paragraph=\"\">It seems AMD and Intel have painted themselves into a corner on two fronts:<\/p>\n<ul class=\"\">\n<li id=\"9af9\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">They don\u2019t have a business model that makes it easy to pursue heterogeneous computing and SoC designs.<\/li>\n<li id=\"7c75\" class=\"kb kc gi kd b hh lj kf kg hk lk ki kj kk ll km kn ko lm kq kr ks ln ku kv kw lp lh li hf\" data-selectable-paragraph=\"\">Their legacy x86 CISC instruction set is coming back to haunt them, making it hard to improve OoO performance.<\/li>\n<\/ul>\n<p id=\"0ff6\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">It doesn\u2019t mean game over. They can of course simply clock up more, use more cooling, throw in more cores, beef up the CPU caches, etc. But they are both at a disadvantage. Intel is in the worst situation, as their cores are already soundly beaten by Firestorm, and they have weak GPUs to integrate with an SoC solution.<\/p>\n<p id=\"cd04\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">The problem with throwing in more cores is that for typical desktop workloads you reach diminishing returns with too many cores. Sure lots of cores are great for severs.<\/p>\n<p id=\"2a88\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">However here companies such as Amazon and Ampere are attacking with monster CPUs with 128 cores. This is like fighting the western and eastern front at the same time.<\/p>\n<p id=\"7fce\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">But fortunately for AMD and Intel, Apple doesn\u2019t sell their chips on the market. So PC users will simply have to put up with whatever they are offering. PC users may jump ship, but that is a slow process. You don\u2019t leave immediately a platform you are heavily invested in.<\/p>\n<p id=\"731a\" class=\"kb kc gi kd b hh ke kf kg hk kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw gb hf\" data-selectable-paragraph=\"\">But young professionals, with money to burn without too deep investments in any platform, may increasingly turn to Apple in the future, beefing up their hold on the premium market and consequently their share of the total profit in the PC market.<\/p>\n<\/div>\n<\/div>\n<\/section>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p class=\"excerpt\">I loved Apple. Not the Apple of DRMs and its golden prison where you can&#8217;t really control\u00a0your hardware; I loved the Apple that loved Software Libero. Then it mutated into a company that crushes people freedoms while smiling. I like to have control of my hardware. I don&#8217;t want to use hardware that treats me&hellip;<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"https:\/\/monodes.com\/predaelli\/2020\/12\/08\/is-apple-m1-the-new-amiga\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[12,80],"tags":[],"class_list":["post-7813","post","type-post","status-publish","format-standard","hentry","category-amiga","category-hardware"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6daft-221","jetpack-related-posts":[{"id":7858,"url":"https:\/\/monodes.com\/predaelli\/2020\/12\/14\/subtle-hostility\/","url_meta":{"origin":7813,"position":0},"title":"Subtle hostility","author":"Paolo Redaelli","date":"2020-12-14","format":false,"excerpt":"nilay patel @reckless 26 nov \u00a0 I haven't plugged the M1 MacBook Pro review unit in for three days. Have been using on and off this evening. Battery: 60 percent M.G. Siegler@mgsiegler\u00a026 nov \u00a0 Speed aside, this is a truly incredible difference that is causing me to change behavior\u2026 That's\u2026","rel":"","context":"In &quot;Hardware&quot;","block_context":{"text":"Hardware","link":"https:\/\/monodes.com\/predaelli\/category\/hardware\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/Apple-M1-battery-life.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/Apple-M1-battery-life.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/Apple-M1-battery-life.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/Apple-M1-battery-life.jpg?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/Apple-M1-battery-life.jpg?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2020\/12\/Apple-M1-battery-life.jpg?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":7876,"url":"https:\/\/monodes.com\/predaelli\/2020\/12\/21\/will-its-magic-be-back\/","url_meta":{"origin":7813,"position":1},"title":"Will its magic be back?","author":"Paolo Redaelli","date":"2020-12-21","format":"link","excerpt":"Apple M1 foreshadows Rise of RISC-V I dearly hope that we will soon have a RISC-V based, massively parallel and with many specialized coprocessors. The magic of Amiga will soon be back, updated for the 21th century.","rel":"","context":"In &quot;Amiga&quot;","block_context":{"text":"Amiga","link":"https:\/\/monodes.com\/predaelli\/category\/amiga\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":7805,"url":"https:\/\/monodes.com\/predaelli\/2020\/12\/06\/un-chip-risc-v-promette-di-demolire-apple-m1-ecco-chi-ce-dietro-e-di-cosa-si-tratta\/","url_meta":{"origin":7813,"position":2},"title":"Un chip RISC-V promette di demolire Apple M1. Ecco chi c&#8217;\u00e8 dietro e di cosa si tratta","author":"Paolo Redaelli","date":"2020-12-06","format":false,"excerpt":"La californiana Micro Magic afferma di aver messo a punto un core basato su ISA RISC-V che non teme confronti, nemmeno l'interessantissimo Apple M1 su base ARM, grazie a frequenze intorno ai 5 GHz e consumi estremamente ridotti. Un chip RISC-V promette di demolire Apple M1. Ecco chi c'\u00e8 dietro\u2026","rel":"","context":"In &quot;Hardware&quot;","block_context":{"text":"Hardware","link":"https:\/\/monodes.com\/predaelli\/category\/hardware\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":475,"url":"https:\/\/monodes.com\/predaelli\/2015\/06\/25\/its-actually-over\/","url_meta":{"origin":7813,"position":3},"title":"It&#8217;s actually over","author":"Paolo Redaelli","date":"2015-06-25","format":false,"excerpt":"Google's corporate motto is \"Don't be evil\". For a convenient definition of \"evil\" it seems when you read news such as this: The default behavior of hotword, a new, black-box module in Chrome (and its free\/open cousin, Chromium) causes it to silently switch on your computer's microphone and send whatever\u2026","rel":"","context":"In &quot;Software Libero&quot;","block_context":{"text":"Software Libero","link":"https:\/\/monodes.com\/predaelli\/category\/software\/software-libero\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":8087,"url":"https:\/\/monodes.com\/predaelli\/2021\/01\/22\/bello-ma-non-e-questo-il-punto\/","url_meta":{"origin":7813,"position":4},"title":"Bello, ma non \u00e8 questo il punto.","author":"Paolo Redaelli","date":"2021-01-22","format":false,"excerpt":"Apprendo con piacere che Corellium ha rilasciato una versione \"completamente usabile\" di Linux per i Mac M1. Bella notizia, tutto sommato, ma non \u00e8 la notizia attesa da chi crede che il software debba essere libero per permettere alle persone di essere libere. \u00c8 notevole la velocit\u00e0 con cui han\u2026","rel":"","context":"In &quot;Ethics&quot;","block_context":{"text":"Ethics","link":"https:\/\/monodes.com\/predaelli\/category\/ethics\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":14501,"url":"https:\/\/monodes.com\/predaelli\/2025\/12\/21\/avoid-nintendo\/","url_meta":{"origin":7813,"position":5},"title":"Avoid Nintendo","author":"Paolo Redaelli","date":"2025-12-21","format":false,"excerpt":"\u00abFSF Says Nintendo's New DRM Allows Them to Remotely Render User's Device 'Permanently Unusuable'\u00bb that is why I will keep avoiding Nintendo hardware as I have done until now.","rel":"","context":"In &quot;Ethics&quot;","block_context":{"text":"Ethics","link":"https:\/\/monodes.com\/predaelli\/category\/ethics\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/7813","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/comments?post=7813"}],"version-history":[{"count":0,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/7813\/revisions"}],"wp:attachment":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/media?parent=7813"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/categories?post=7813"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/tags?post=7813"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}