I hadn't seen this before, but it looks a lot like it's less well known largely because it's very incomplete (almost no SIMD) and redundant with Agner Fog's work and InstLat.
It concentrates on integer ops and leaves out branches altogether. The Intel Optimization manual is pretty light on indirect branches as well.
What I like about it is that it makes it easy to compare how microarchitectures handle a given instruction over time. For example, the BT instructions were a little slow, 2 cycles, 1 port through Haswell. With Broadwell that changed to 1 cycle and 2 ports. Similarly, CMOV improved with Broadwell (remember Linus' rant about the evils of CMOV?) and is now 1 cycle latency.
I am very fond of this document, and am constantly amazed at how the commentariat, here and elsewhere, frequently like to theorize about what instructions "might be expensive" without bothering to look them up.
> Thankful the guy engineering bridges doesn't make up which material to use like software devs pick algos.
Except that one time where the builder said "this joint is hard to build, can we modify it slightly?", the engineer looked at the change, said "sure", and 114 people died when said joint failed.
In a word, empirically. In a few more words, empirically and reading a ton of Intel, AMD and VIA documentation and I'd posit, some of the patent and academic literature.
He explains his method in Instruction Tables [1] under 'How the values were measured' section and he even includes a zip of the code. I found this part particularly interesting:
> It is not possible to measure the latency of a memory read or write instruction with software methods.
It is only possible to measure the combined latency of a memory write followed by a memory read
from the same address. What is measured here is not actually the cache access time, because in
most cases the microprocessor is smart enough to make a "store forwarding" directly from the write
unit to the read unit rather than waiting for the data to go to the cache and back again. The latency of
this store forwarding process is arbitrarily divided into a write latency and a read latency in the tables.
But in fact, the only value that makes sense to performance optimization is the sum of the write time
and the read time.
Wow! This is a great doc! These days I'm targetting things other than x86 for the day job, but this level of insight, when also armed with -O3 -S assembly output from a compiler, is what really lets one go to town...
https://gmplib.org/~tege/x86-timing.pdf