Wednesday, 21 December 2016

Caches, Latency, RISC & CISC: How Does a 386 Beat a 1.6GHz Celeron?

So I just watched a video by the WaybackTECH simply entitled 386DX-33 vs Celeron 420 1.6Ghz Faceoff and it got me thinking. A lot.

WaybackTECH has a lot of cool videos about old hardware and his repair guides are pretty fascinating. I have been frustrated at times, however, when he can't be bothered to explain some of his findings. But hey, I can't be bothered to benchmark a 386DX-33 against a Celeron 420 so we'll call this a team effort, shall we?

Because I used to teach computer science in schools it means I have to understand what I'm talking about. If you're one of these teachers who reads a book the night before and, the next day in class, regurgitates what you read, prepare to be exposed as a fraud because those kids will ask you questions. And yes, it's happened to me.

I certainly didn't understand the results of the above video. In summary, a 1.6GHz CPU with its caches disabled shows performance equivalent to a 33MHz 386. Yes, really. He didn't use any 'special' benchmarks, didn't cripple any of the components or cheat in any way - all he did was run 16-bit code in DOS. There was some discussion in the comments about how this could be possible and I started writing my own reply. But it got long. Really long. So I've taken it to the blog instead.


The short version

New CPUs are designed to work with cache, where old CPUs, like the 386, were not (although cache could dramatically enhance performance). It doesn't matter how fast a CPU is if it takes many clock cycles for it to complete an instruction. Example: if it takes 5 cycles for a 16MHz 386 to do a specific task (such as addition) then it can do 3.2 million additions per second. What's happening here is that, even though the Celeron is running 100x faster, if it takes 500 cycles to do addition (without caches), then it will only be able to do 3.2 million additions per second as well. It's actually a lot more complicated than this as you can see below, but this is a basic, crude way of explaining it.


The long version

So why the hell does a fast CPU sit around for so long with its caches disabled? First we need to talk about how caches and CPUs work. There are many articles out there explaining this along with the reasons why CPUs are so dependent on them, but many of these sources are overly technical or dull. I don't claim to be any less so - it's a dry subject, but I'll do my own summary here anyway for those who a) like to read plain English b) are interested in the history of computing and c) can't be bothered to read it elsewhere.

Fetch, Decode, Execute
When the first IBM PC was first released (it was called the model 5150 back then - no one called it a PC yet), the Intel 8088 CPU, the RAM and the 8-bit ISA bus (how instructions travelled between CPU and RAM) were the same speed: 4.77MHz. This continued when the the 80286 came along - later Intel versions of this ran at 8MHz and so did the 16-bit ISA bus. In both cases it would take approximately 5 cycles for a single instruction to be completed. So, just because a CPU runs at 8MHz, it doesn't mean it can complete 8 million instructions per second, no. A 10MHz 286 apparently clocks in at 1.5 million instructions per second. The reason for this is that the process of performing a task from start to finish takes time. Let's take the case of a simple addition being carried out:

The Fetch, Decode, Execute Cycle
[source: BBC Education]
1. Retrieve Instruction: the CPU has to access the RAM and find out what instruction it's doing next. This instruction is remembered in a 'register' (a very small area of memory).

2. Data A: It then needs to find out what data to perform the instruction on. For addition we need two pieces of data, both of which will be stored in two different locations. Each piece of data has a) an address and b) the data itself. On the 8088, the CPU first had to find out the address before it could retrieve it - that's two cycles. The 286 separated the address and data buses so that these could be retrieved simultaneously. Once retrieved, the data is also stored in a register.

3. Data B: the second piece of data is retrieved and stored in a register.

4. Perform instruction: addition is performed on the data in question.

5. Store new data: the result of the addition is stored back in RAM.

This is a very basic description of what happens, but I wanted to explain why an instruction takes 5 cycles instead of 1. Modern CPUs work very much in the same way, but with some extra bells and whistles such as branch prediction, parallel processing and pipelining. The thing about many of the modern features is that they are designed from day one to work with fast cache RAM. Which leads us on to...

Latency The biggest factor to be aware of is latency. When the CPU and RAM run at the same speed, it takes one cycle for the RAM to respond to a request from the CPU. As CPU speeds increased, so did the speed of the bus but only to a point. On faster 286 systems, such as those running at 20MHz, some expansion cards would not work correctly because they weren't ever designed to work at such speeds. This led to motherboard manufacturers introducing a 'bus divider'. This allowed the CPU to run at its normal speed, while peripherals ran at a slower rate e.g. 10MHz.

Meanwhile, the RAM remained connected directly to the CPU but, because faster RAM was extremely expensive, it continued to work at slower speeds in most consumer desktops. This introduced a problem: if the RAM was running half as fast as the CPU, for example, the CPU would have to wait twice as long for a request for data to be fulfilled. These 'wait states' caused a performance hit also known as a bottleneck.

DRAM [source: Wikimedia]
The speed of RAM and of the CPU is typically indicated in multiples of Hz (cycles per second). By dividing 1 second by this amount it gives us the 'clock period' in nanoseconds - this is the smallest period of time it takes for an instruction to take place. The faster a component in MHz, the less time it takes to do something, and the more it can do per second. In the 286 era PC system memory at this time was typically DRAM, which had a latency of 120 nanoseconds. This works out at about 8MHz. SRAM from the time had a latency of only 10ns (100MHz) but was very expensive, so they used it to create a smaller area of memory between the CPU and the RAM - this is the cache. A 386DX motherboard with 64KB of cache made a huge difference to the performance of the system because frequently-used instructions remained in the cache meaning that writes to and from the main RAM were less frequent. I'm not going to go into any more detail about the caching process itself because it's not important, especially not in a video where it wasn't used!

Note: a 386DX with cache is significantly faster than a 386SX without cache for two reasons. 1) The SX had to wait for the RAM every time it requested some data and 2) the SX had a 16-bit external bus, meaning it took two cycles to retrieve 32-bit data or instructions from RAM instead of 1 cycle for the fully 32-bit DX.

Moving on, we now skip forwards to more modern CPUs. Over time the speed gap between CPUs and RAM has widened significantly - RAM manufacturers focused on capacity, CPU manufacturers focused on speed and efficiency. Proportional increases in cache size and cleverer CPUs have bridged this gap somewhat, but it is still clock cycles and wait states that are the biggest factors in defining how quickly a system can perform when cache isn't there to lubricate things. Remove the cache, and you effectively negate the GHz advantage the CPU has over the RAM because it's is sitting there twiddling its thumbs for most of the time. A 386 CPU takes maybe 1 or 2 cycles to access memory because it runs at a similar speed to the RAM. That's not much twiddling at all. The difference between CPU speed and RAM speed is the single biggest factor affecting performance and this difference is pretty big in newer systems. Let's look at a 2GHz Pentium 4:

- The 2GHz CPU has a cycle time of 0.5ns.
- Its RAM runs at 400MHz, so the latency is 2.5ns.

2.5ns is the equivalent of 5 cycles and that's how long the CPU has to wait between memory accesses, so it is effectively running at 400Mhz too. It's only caching that prevents every memory access taking this many cycles and the full speed of the CPU can be harnessed. But a 400MHz CPU should still kick a 386's butt, right? There's something else...
CISC & RISC
HighTreason alluded to the other factor in his recent NexGen video: CISC and RISC. He declined the opportunity to explain the difference between the two so, again, I'll do a short explanation (although this entire article is turning into a long explanation!).

Every CPU has what's called an 'instruction set'. The original Intel 386, the granddaddy of all modern CPUs, was a CISC or Complex Instruction Set Computing CPU. As most people know, a CPU is made up of millions of transistors. The order in which these are wired together allows it to perform a range of tasks or instructions. Let's take addition again. To add two 1 bit numbers and carry a remainder you need as many as 18 transistors. Multiply this by 64 to be able to add two 64-bit numbers and that's quite a lot. Now, if you want to perform multiplication you can either use the adder logic to perform addition many times until you get your result or you can build physical logic that will perform mutiplication in one go. The latter approach is more efficient and faster, but it also adds to the complexity and the cost of the CPU. Currently, the x86-64 instruction set stands at over 2,000 instructions according to Stefan Heule at least. If basic addition of 64-bit numbers uses over 1,000 transistors, you can see how 1,999 more complex instructions is going to proportionally increase this amount, and maybe exponentially in some cases. Then you have to add transistors for cache and registers. More transistors increases the physical footprint of the CPU die, the amount of electrical power required, and the amount of heat produced.

RISC (Reduced Instruction Set Computing) addresses some of these issues. This approach to CPU design doesn't reduce the actual number of instructions the CPU is capable of performing like you might assume. It should technically be called Less Complex Logic Instruction Set Computing because the physical transistor logic is simplified and therefore it takes less time to complete a given instruction (e.g. 1 cycle instead of 5). The direct result of this is that it makes more space for registers (memory) on the chip itself, less frequent writes to main memory, and faster performance overall. Again, this is a basic (hopefully not inaccurate) explanation.

Up until the Pentium, Intel CPUs and their clones used a pure CISC approach: dense instructions that made the most of available RAM but also required frequent access to main memory. It became clear that the RISC approach was better from an efficiency point of view, especially in a scenario where RAM is significantly slower than the CPU. Intel developed a RISC processor core that still accepted x86 CISC instructions by translating them into RISC instructions. Extra cycles are required for this, which is fine when you have lots of spare cycles at your disposal. But in a cache-less system that is only running as fast as the RAM, these extra cycles divide the speed even further. I'd like to be able to provide actual proof of this but I have reached the limits of my technical insight. Maybe someone else can but I'm going to let the evidence speak for itself. I'm just providing an explanation.

Feel free to comment below if you'd like to add / clarify / critique. Just be nice about it :)

References
The Gap between Processor and Memory Speeds (semanticscholar.org)
What's new in CPUs since the 80s and how does it affect programmers? (danluu.com)
Memory specs for Pentium processors (philrees.co.uk)
Why can't you have both high instructions per cycle and high clock speed? (superuser.com)
Single-Cycle Performance (washington.edu)
CPU cache (wikipedia.org)
Reduced instruction set computing (wikipedia.org)
Complex instruction set computing (wikipedia.org)
A Zero Wait State Secondary Cache for Intel’s Pentium (nxp.org)
System Timing (phatcode.net)
Why does Intel hide internal RISC core in their processors? (stackoverflow.com)