uProf is decent, but the instruction-based sampling approach is the only way to ...

uProf is decent, but the instruction-based sampling approach is the only way to get pinpoint accuracy on instructions.

> I use ECC on a (consumer) Ryzen chip/board and edac-util seems to give me the same information that it does on Intel - what's missing?

Event-based sampling on Intel is accurate to an instruction-level (while event-based sampling on AMD is less accurate. You're forced to use the more complicated IBS metrics if you want instruction-level accuracy of events).

Intel also has branch-history data stored. Super useful for some developer tools, but I forget which tools those were...

-----------

I think AMD uProf is certainly usable. And the price is good (free). But Intel vTune is just light-years ahead.

AMD vs CUDA on the other hand is... closer than I think most people realize. CUDA has a bunch of libraries (Thrust, TensorFlow support, etc. etc.) which helps. But if you're doing high-performance coding, you'll likely have to write your own specialized data-structures. At least, that's the approach I'm doing with some GPU hobby code I'm writing.

TensorFlow (due to Tensorcores) and BLAS are solidly NVidia advantages. But general purpose libraries (ex: Thrust) is more of a convenience.

AMD's main disadvantage is documentation. But the tools are actually quite usable. AMD documents the lowest level well (the ISA), but their HIP / HCC / etc. etc. documents are lacking and difficult for beginners to follow.

AMD should work on updating their beginner guides (their OpenCL guides) to their ROCm framework. Even if its ROCm OpenCL 2.0 stuff, its important to get beginners to use their platform. Or at least, update their beginner guides to reference GPUs that have come out within the past 5 years...