The output needs a bit of manual review (see the mixing of 32-bit and 16-bit accesses to the same register in the second example), but it was definitely able to find something non-obvious in a sequence far too long for exhaustive search.
Assembler optimization is starting to sound like the ASIC tooling I've been reading about. That's exciting given what's been accomplished with that. :)
I'm not sure what your background is or how much you know about this stuff. Unlike software, almost every aspect of the tooling for hardware is Mega-HARD. Also, I collected a ton of shit on that this year in hopes academics or FOSS types will make OSS alternatives to Big Three. So, here's some intro and example links for you to go through to get an idea of processes and techniques involved. Only then, if you're still committed to learning, will I try to organize and send you some links on specific techniques for achieving them.
Note: I was going to post a custom flow w/ example of MIPS processor but damned thing is totally gone from Google and DuckDuckGo. And my filesharing service is down lol. Put an email address, public or disposable, in your profile then I'll send it to you manually. Or just email the address in my profile with your HN name in subject or body. It's too good to pass up. Do read the two intro's first so you appreciate the work that level requires.
I've finished reading the links. The first one was great. Thanks, I learnt something. It's not my field, but I'm vaguely familiar with the workflow, having friends who design ASICs, work at Xilinx, etc. I've encountered some algorithms in advanced compiler texts.
But what I was specifically curious about was applications of stochastic optimization to EDA workflow. It sounded like you'd seen papers on that topic.
My interest is purely in applications of stochastic optimization. I'm not about to jump into EDA.
Ok, now that I know your focus, I basically just used my fast googling technique and searched through recent archives to see if anything popped up. Here's various things in hardware, software, and architecture benefiting from SO to learn from or improve on in your own work.
Note: This does not do stochastic I don't think but a number of other methods. It's significant because it specifically mentions modern, typically large, designs avoid stochastic optimization for global issues because it's cost-prohibitive compared to geometric methods. That might have significant implications outside EDA, too. Might only be good for local optimizations that are tied together some other way.
Took a brief look. Seems like a cool idea, but it smells like early days.
The example listed on github (population count of 1s in a 64bit register) I thought was a little underwhelming. It took a poorly optimized loop, for some reason came to the conclusion that 0's would be more popular than they would be with a uniform distribution of inputs as you'd get in a compressed stream of bytes, for example (did I miss something there?), so it added some overhead to special case 0 even though 0 only takes one iteration through the loop anyway, and the loop still required up to 64 iterations if you have the high bit set. If doing population counts on compressed data (most of what traverses the internet), odds are you'll generally have at least one bit set in every byte, so it'll usually be doing a lot of iterations.
If the code matters, then on an x86_64 arch, you can definitely burn a few cache lines to use lookup tables for considerably more speed. 256 byte lookup (4 cache lines) to bring the loop down to 8 iterations, 64k (1k cache lines) to bring it down to 4, assuming you're confident you can warm that latter table up in the dcache with enough activity.
I guess it isn't making guesses with lookup tables yet, but that's when I'd get all nipply.
> so it added some overhead to special case 0 even though 0 only takes one iteration through the loop anyway, and the loop still required up to 64 iterations if you have the high bit set.
Didn't it simply replace the body of `size_t popcnt(uint64_t x)` with the `popcnt` instruction? Or am I looking at the wrong test?
It seems crazy that stochastic optimization written by a couple of grad students is better than 30 years of the very best programmers writing optimizing compilers by hand.
Take something like Inria's certified C compiler, then run STOKE (with loop optimizations available) and you've got a provably correct optimizing compiler that's as fast or faster than GCC, Clang, VS, and Intel but without the compilation bugs.
There are a lot of compiler optimizations that uses global analyses/optimizations that are not covered by optimizing a sequence of assembly instructions. Thus, I doubt that CompCert + STOKE would reach the code quality of production compilers.
Nevertheless, I agree that superoptimizers ofter find optimizations that are not considered by the compiler engineers. Another good example is the Optgen tool [1] that also identified missing (machine-independent) optimizations in all production compilers.
I would really like the authors to discuss the effects of power-saving features, for example dynamic clocking of the CPU.
Indeed ATLAS will often fail to converge or get the wrong tuning when automatic power-tuning is not disabled. In a modern system is becoming increasingly difficult to manage concurrent power saving features, for example things like Intel's Turbo Boost or the clock on your mobile device.
Do you have a reference for this problem? I would love to hear more. Is it thought to be a hardware bug, like this recent snafu with arithmetic (but for which power tuning was not mentioned as contributing) http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-cau... ?
EDIT: Took a look at the issues for the repo, and here are links that work:
https://cs.stanford.edu/people/eschkufz/docs/asplos291-schku...
https://cs.stanford.edu/people/eschkufz/docs/oopsla011-sharm...
https://cs.stanford.edu/people/eschkufz/docs/pldi52-schkufza...
In the issue he also mentioned this paper that's linked from his page:
https://cs.stanford.edu/people/eschkufz/docs/cove.pdf
Going to start reading through these now, seems pretty interesting and I've never come across something like this before.