The Microarchitecture of Intel, AMD and VIA CPUs [pdf]

CalChris · on June 4, 2017

Less well known, but Torbjorn Granlund Instruction latencies and throughput for AMD and Intel x86 processors has also been updated for Ryzen.

https://gmplib.org/~tege/x86-timing.pdf

glangdale · on June 4, 2017

I hadn't seen this before, but it looks a lot like it's less well known largely because it's very incomplete (almost no SIMD) and redundant with Agner Fog's work and InstLat.

CalChris · on June 4, 2017

It concentrates on integer ops and leaves out branches altogether. The Intel Optimization manual is pretty light on indirect branches as well.

What I like about it is that it makes it easy to compare how microarchitectures handle a given instruction over time. For example, the BT instructions were a little slow, 2 cycles, 1 port through Haswell. With Broadwell that changed to 1 cycle and 2 ports. Similarly, CMOV improved with Broadwell (remember Linus' rant about the evils of CMOV?) and is now 1 cycle latency.

I don't know InstLat. Do you have a link?

mulvya · on June 4, 2017

http://instlatx64.atw.hu/

glangdale · on June 4, 2017

I am very fond of this document, and am constantly amazed at how the commentariat, here and elsewhere, frequently like to theorize about what instructions "might be expensive" without bothering to look them up.

brianwawok · on June 4, 2017

Many programmers do things "because they are faster" with 0 work testing the theories. A little bit funny and a little bit sad.

Thankful the guy engineering bridges doesn't make up which material to use like software devs pick algos.

jcranmer · on June 4, 2017

> Thankful the guy engineering bridges doesn't make up which material to use like software devs pick algos.

Except that one time where the builder said "this joint is hard to build, can we modify it slightly?", the engineer looked at the change, said "sure", and 114 people died when said joint failed.

CalChris · on June 4, 2017

Agner Fog's microarchitecture document has been updated for AMD Ryzen.

Coding_Cat · on June 4, 2017

Does anyone know how Agner actually produces all this information? It can't be easy to determine all these parameters.

CalChris · on June 4, 2017

In a word, empirically. In a few more words, empirically and reading a ton of Intel, AMD and VIA documentation and I'd posit, some of the patent and academic literature.

tmccrmck · on June 4, 2017

He explains his method in Instruction Tables [1] under 'How the values were measured' section and he even includes a zip of the code. I found this part particularly interesting:

> It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time.

[1] http://www.agner.org/optimize/instruction_tables.pdf

magnat · on June 4, 2017

For Intel CPUs, it's most likely based on Architectures Optimization Reference Manual published by Intel - https://www.intel.com/content/dam/www/public/us/en/documents...

acqq · on June 4, 2017

No. He measures the latencies with carefully written programs.

throwaway-1209 · on June 5, 2017

Does anyone know of a similar document for ARM, and in particular for the various flavors of aarch64?

DamonHD · on June 4, 2017

Wow! This is a great doc! These days I'm targetting things other than x86 for the day job, but this level of insight, when also armed with -O3 -S assembly output from a compiler, is what really lets one go to town...