STOKE – a stochastic optimizer for x86_64 assembly

anp · on Jan 19, 2016

I was very curious to skim the papers that (I presume) covered results on non-toy programs. Unfortunately all three 404 :(.

EDIT: Took a look at the issues for the repo, and here are links that work:

https://cs.stanford.edu/people/eschkufz/docs/asplos291-schku...

https://cs.stanford.edu/people/eschkufz/docs/oopsla011-sharm...

https://cs.stanford.edu/people/eschkufz/docs/pldi52-schkufza...

In the issue he also mentioned this paper that's linked from his page:

https://cs.stanford.edu/people/eschkufz/docs/cove.pdf

Going to start reading through these now, seems pretty interesting and I've never come across something like this before.

viraptor · on Jan 19, 2016

> I've never come across something like this before.

Have a look at Souper. I believe it's got slightly different approach, but the same idea / result. (https://github.com/google/souper)

Actually I'm a bit disappointed that none of the papers references it (did a quick text search only).

UniQP · on Jan 19, 2016

That's because a) the initial STOKE paper (2013) is older than Souper (initial commit 2014) and b) there is no Souper paper so far

derf_ · on Jan 19, 2016

I applied this to some pieces of Daala's entropy coder: https://review.xiph.org/763/#msg2

The output needs a bit of manual review (see the mixing of 32-bit and 16-bit accesses to the same register in the second example), but it was definitely able to find something non-obvious in a sequence far too long for exhaustive search.

nly · on Jan 19, 2016

> This will produce a directory named bins that contains the text of every function contained in the binary ./a.out.

This is pretty cool functionality in itself!

nickpsecurity · on Jan 19, 2016

"Stochastic optimization," "equivalence checking..."

Assembler optimization is starting to sound like the ASIC tooling I've been reading about. That's exciting given what's been accomplished with that. :)

RossBencina · on Jan 19, 2016

Can you recommend related reading in the ASIC tooling literature?

nickpsecurity · on Jan 19, 2016

I'm not sure what your background is or how much you know about this stuff. Unlike software, almost every aspect of the tooling for hardware is Mega-HARD. Also, I collected a ton of shit on that this year in hopes academics or FOSS types will make OSS alternatives to Big Three. So, here's some intro and example links for you to go through to get an idea of processes and techniques involved. Only then, if you're still committed to learning, will I try to organize and send you some links on specific techniques for achieving them.

ASIC flow and EDA techniques intro http://cc.ee.ntu.edu.tw/~ywchang/Courses/PD/EDA_Chapter1.pdf

Another covering most flows http://www.csl.cornell.edu/courses/ece5745/handouts/ece5745-...

Modern example of a Standard Cell flow http://www.cs.berkeley.edu/~yunsup/papers/riscv-esscirc2014....

A flow that's a lot closer to the metal with some custom work http://lejpt.academicdirect.org/A19/133_143.pdf

Note: I was going to post a custom flow w/ example of MIPS processor but damned thing is totally gone from Google and DuckDuckGo. And my filesharing service is down lol. Put an email address, public or disposable, in your profile then I'll send it to you manually. Or just email the address in my profile with your HN name in subject or body. It's too good to pass up. Do read the two intro's first so you appreciate the work that level requires.

RossBencina · on Jan 27, 2016

I've finished reading the links. The first one was great. Thanks, I learnt something. It's not my field, but I'm vaguely familiar with the workflow, having friends who design ASICs, work at Xilinx, etc. I've encountered some algorithms in advanced compiler texts.

But what I was specifically curious about was applications of stochastic optimization to EDA workflow. It sounded like you'd seen papers on that topic.

My interest is purely in applications of stochastic optimization. I'm not about to jump into EDA.

nickpsecurity · on Jan 29, 2016

Ok, now that I know your focus, I basically just used my fast googling technique and searched through recent archives to see if anything popped up. Here's various things in hardware, software, and architecture benefiting from SO to learn from or improve on in your own work.

Really detailed write-up of how cells can be modeled and transformed with stochastic tie in http://www.eecs.berkeley.edu/~keutzer/classes/244fa2004/pdf/...

Parallel, stochastic, FGPA placement with great speedup http://www.eecg.utoronto.ca/~vaughn/papers/fccm2014_parallel...

Multi-optimization scheme for nanometer https://repositories.lib.utexas.edu/handle/2152/3688

Note: This does not do stochastic I don't think but a number of other methods. It's significant because it specifically mentions modern, typically large, designs avoid stochastic optimization for global issues because it's cost-prohibitive compared to geometric methods. That might have significant implications outside EDA, too. Might only be good for local optimizations that are tied together some other way.

Synthesis and control of system design for industrial project (software) http://www.scielo.org.ar/pdf/laar/v40n2/v40n2a07.pdf

RTL verification by minimizing multiway decision graphs (2006) http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=D9E...

Use in clock-tree synthesis for multi-FPGA's with focus on thermal properties http://thescipub.com/PDF/ajassp.2013.1604.1615.pdf

Synthesis of high-performance analog circuits (1996) https://users.ece.cmu.edu/~rutenbar/pdf/rutenbar-aotcad96.pd...

System levle hardware and software partitioning with simulated annealing and tabu search http://www.gstitt.ece.ufl.edu/courses/fall08/eel4930_5934/re...

Simulated annealing for aiding genetic algorithms in software architecture synthesis http://www.sis.uta.fi/~tp54752/pub/sa/SA_aiding_GA_acta.pdf

Synthesize implementations of hard real-time stuff http://link.springer.com/article/10.1007/BF00872288?no-acces...

Humie awards from Genetic Programming. Most of these could probably be ported to Stochastic Optimization in general. http://www.genetic-programming.org/combined.html

Accelerate floating point problems https://web.stanford.edu/~sharmar/pubs/fp.pdf

Stochastic super-optimization of loop-free binaries https://web.stanford.edu/~sharmar/pubs/asplos291-schkufza.pd...

Note: I'm keeping that one for sure. :)

tslug · on Jan 19, 2016

Took a brief look. Seems like a cool idea, but it smells like early days.

The example listed on github (population count of 1s in a 64bit register) I thought was a little underwhelming. It took a poorly optimized loop, for some reason came to the conclusion that 0's would be more popular than they would be with a uniform distribution of inputs as you'd get in a compressed stream of bytes, for example (did I miss something there?), so it added some overhead to special case 0 even though 0 only takes one iteration through the loop anyway, and the loop still required up to 64 iterations if you have the high bit set. If doing population counts on compressed data (most of what traverses the internet), odds are you'll generally have at least one bit set in every byte, so it'll usually be doing a lot of iterations.

If the code matters, then on an x86_64 arch, you can definitely burn a few cache lines to use lookup tables for considerably more speed. 256 byte lookup (4 cache lines) to bring the loop down to 8 iterations, 64k (1k cache lines) to bring it down to 4, assuming you're confident you can warm that latter table up in the dcache with enough activity.

I guess it isn't making guesses with lookup tables yet, but that's when I'd get all nipply.

barsonme · on Jan 19, 2016

> I thought was a little underwhelming.

I agree, but...

> so it added some overhead to special case 0 even though 0 only takes one iteration through the loop anyway, and the loop still required up to 64 iterations if you have the high bit set.

Didn't it simply replace the body of `size_t popcnt(uint64_t x)` with the `popcnt` instruction? Or am I looking at the wrong test?

joosters · on Jan 19, 2016

Didn't it optimise the whole function into the special purpose popcnt instruction? Are you confusing the initial gcc code with the output?

erichocean · on Jan 20, 2016

It seems crazy that stochastic optimization written by a couple of grad students is better than 30 years of the very best programmers writing optimizing compilers by hand.

Take something like Inria's certified C compiler, then run STOKE (with loop optimizations available) and you've got a provably correct optimizing compiler that's as fast or faster than GCC, Clang, VS, and Intel but without the compilation bugs.

Really cool stuff.

UniQP · on Jan 20, 2016

There are a lot of compiler optimizations that uses global analyses/optimizations that are not covered by optimizing a sequence of assembly instructions. Thus, I doubt that CompCert + STOKE would reach the code quality of production compilers.

Nevertheless, I agree that superoptimizers ofter find optimizations that are not considered by the compiler engineers. Another good example is the Optgen tool [1] that also identified missing (machine-independent) optimizations in all production compilers.

[1] http://pp.ipd.kit.edu/optgen/

pklausler · on Jan 19, 2016

Reminds me of the "superoptimizer" from the 80's.

akkartik · on Jan 19, 2016

Ah, thanks for the reminder. This is useful background: http://blog.regehr.org/archives/923

daguava · on Jan 19, 2016

its sandboxing behavior is a really nice touch. This is a really cool project, interested to see if this continues.

frozenport · on Jan 19, 2016

This is great and reminds me of the ATLAS BLAS[1]

I would really like the authors to discuss the effects of power-saving features, for example dynamic clocking of the CPU.

Indeed ATLAS will often fail to converge or get the wrong tuning when automatic power-tuning is not disabled. In a modern system is becoming increasingly difficult to manage concurrent power saving features, for example things like Intel's Turbo Boost or the clock on your mobile device.

[1]http://math-atlas.sourceforge.net/

jepler · on Jan 19, 2016

Do you have a reference for this problem? I would love to hear more. Is it thought to be a hardware bug, like this recent snafu with arithmetic (but for which power tuning was not mentioned as contributing) http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-cau... ?

frozenport · on Jan 19, 2016

So, modern ATLAS has a CPU throttling check, for example: http://stackoverflow.com/questions/14592401/atlas-install-re...

zump · on Jan 19, 2016

Seems a bit involved to get a result. It should be like american fuzzy lop.