Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Microarchitecture of Intel, AMD and VIA CPUs [pdf] (agner.org)
205 points by CalChris on June 4, 2017 | hide | past | favorite | 15 comments


Less well known, but Torbjorn Granlund Instruction latencies and throughput for AMD and Intel x86 processors has also been updated for Ryzen.

https://gmplib.org/~tege/x86-timing.pdf


I hadn't seen this before, but it looks a lot like it's less well known largely because it's very incomplete (almost no SIMD) and redundant with Agner Fog's work and InstLat.


It concentrates on integer ops and leaves out branches altogether. The Intel Optimization manual is pretty light on indirect branches as well.

What I like about it is that it makes it easy to compare how microarchitectures handle a given instruction over time. For example, the BT instructions were a little slow, 2 cycles, 1 port through Haswell. With Broadwell that changed to 1 cycle and 2 ports. Similarly, CMOV improved with Broadwell (remember Linus' rant about the evils of CMOV?) and is now 1 cycle latency.

I don't know InstLat. Do you have a link?



I am very fond of this document, and am constantly amazed at how the commentariat, here and elsewhere, frequently like to theorize about what instructions "might be expensive" without bothering to look them up.


Many programmers do things "because they are faster" with 0 work testing the theories. A little bit funny and a little bit sad.

Thankful the guy engineering bridges doesn't make up which material to use like software devs pick algos.


> Thankful the guy engineering bridges doesn't make up which material to use like software devs pick algos.

Except that one time where the builder said "this joint is hard to build, can we modify it slightly?", the engineer looked at the change, said "sure", and 114 people died when said joint failed.


Agner Fog's microarchitecture document has been updated for AMD Ryzen.


Does anyone know how Agner actually produces all this information? It can't be easy to determine all these parameters.


In a word, empirically. In a few more words, empirically and reading a ton of Intel, AMD and VIA documentation and I'd posit, some of the patent and academic literature.


He explains his method in Instruction Tables [1] under 'How the values were measured' section and he even includes a zip of the code. I found this part particularly interesting:

> It is not possible to measure the latency of a memory read or write instruction with software methods. It is only possible to measure the combined latency of a memory write followed by a memory read from the same address. What is measured here is not actually the cache access time, because in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables. But in fact, the only value that makes sense to performance optimization is the sum of the write time and the read time.

[1] http://www.agner.org/optimize/instruction_tables.pdf


For Intel CPUs, it's most likely based on Architectures Optimization Reference Manual published by Intel - https://www.intel.com/content/dam/www/public/us/en/documents...


No. He measures the latencies with carefully written programs.


Does anyone know of a similar document for ARM, and in particular for the various flavors of aarch64?


Wow! This is a great doc! These days I'm targetting things other than x86 for the day job, but this level of insight, when also armed with -O3 -S assembly output from a compiler, is what really lets one go to town...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: