V8: Workaround for Intel Gemini Lake Processor Bug

markdog12 · on Oct 5, 2019

Bruce Dawson is one of the programmers involved. I'd highly recommend some of his fantastic blog posts, usually Windows-oriented perf pitfalls.

This is a great one to start with, "24-core CPU and I can’t type an email": https://randomascii.wordpress.com/2018/08/16/24-core-cpu-and...

snagglegaggle · on Oct 5, 2019

These are indeed very good articles. Even if you are all-in to nix/BSD, reading this can be quite eye opening. While at the lowest level the Windows kernel is basically a black box over a Unix kernel, there are in some places unique design choices that lead to unique problems. Sadly we do not usually get to see their resolution or the impact they had on MS's customers.

bogomipz · on Oct 5, 2019

>"While at the lowest level the Windows kernel is basically a black box over a Unix kernel,"

Can you elaborate on this? I've never heard this before.

roelschroeven · on Oct 5, 2019

Neither have I. What I have heard before is that the Windows kernel is VMS-like rather than Unix-like.

bogomipz · on Oct 6, 2019

Interesting, might you or anyone else have any links or literature on the the the Windows kernel and its VMS likeness?

roelschroeven · on Oct 17, 2019

There is https://everything2.com/title/The+similarities+between+VMS+a... which looks quite detailed. But this one seems more reliable, since it's written by Mark Russinovich who is more or less the authority on Windows internals: https://www.itprotoday.com/compute-engines/windows-nt-and-vm...

baybal2 · on Oct 5, 2019

Gemini lake has quite a huge errata

Intel is 1 year late on delivering a Jasper lake (Gemini lake refresh,) Tremont is also way too late. Elkhart lake (Xeon branded Atom) delayed indefinitely. Skyhawk lake is said to arrive 2Q-3Q late next year.

Mercury lake seem to be certainly cancelled as early buyers are getting refunds after 2 years of repeated shipping delays

And yes, the "Intel's Saviour" Lakefield is just coming out of fabs as we speak

Chinese laptop OEMs are scratching their heads very hard now. All want to jump the ship for AMD, but Intel holds them hostage through completely anticompetitive "preferred partner" agreements

Things are so bad among OEMs in China that people are now making laptops with desktop chips

icefo · on Oct 5, 2019

This makes me worried about the future of processors. I know bugs have always existed and sometimes they were very serious but they seem to have increased in frequency in the last few years.

Is it just me or it's an actual trend ?

dooglius · on Oct 5, 2019

See https://danluu.com/cpu-bugs/ which cites anonymous engineers blaming a push by upper management for less validation and quicker products.

fpgaminer · on Oct 5, 2019

I've met with Intel execs; one of them told me, not in as much detail as that anonymous engineer expounds, but simply that they reduced their emphasis on validation. So there's no need to speculate here (pun intended).

dboreham · on Oct 5, 2019

This is a general pattern: see stories of bridges falling down for the canonical example.

selestify · on Oct 5, 2019

Why does this general pattern exist across such wide swathes of society?

SkyMarshal · on Oct 5, 2019

Seems to be related to the financialization of everything, where financial engineering supplants actual engineering and financial culture replaces engineering culture at the C-level. Here’s a good article on Boeing about that:

https://mattstoller.substack.com/p/the-coming-boeing-bailout

bcaa7f3a8bbc · on Oct 5, 2019

I have an essentially identical question: Factory owners do not want their factories to explode and lost ten million dollars by the virtue of profit-seeking capitalism. Yet, being profit-oriented and neglecting basic safety and preventative maintenance is a very common phenomenon, always happens, and seen as the evil of "business people", and it happens even in the most powerful company. I don't understand it.

Is it simply cognitive bias at work, like LessWrong often says? But I think there must be deeper reasons than that. Did anyone write good books on this subject? Either sociology, psychology, management & decision-making or economics prospective is welcomed.

mntmoss · on Oct 5, 2019

"What gets measured gets improved."

It's short-termism, which is a common threat to society at large. Think of cities that build in sites vulnerable to earthquakes or floods, for example. After the event hits, everyone says, "Oh, we will never do that again," and it is true for approximately one generation. Then, unless the culture has evolved to value metrics coinciding with these kinds of long-term sustainability issues, they immediately start to relax any preventative policies.

In business, all the cycles are shorter, there's always a "new thing", and next to nobody has the kind of deep institutional experience you would see in a big topic like city planning. Thus the valuation metrics of all agents will quickly fall to current market pricing, and competition may act to hold those metrics in place - if you are the only restaurant in town that doesn't cut corners, customers will complain about how overpriced you are. Regulation, labor and consumers exercising their power all have a role in changing what a business can or can't do by raising the floors on acceptable practice.

But let's say you are exceptionally good at operating a restaurant on your own - quality everywhere - and grow to have a chain. Now you need to hire managers, and the viable hiring pool consists of the people who were cutting corners before - because there is literally nobody else out there. Good luck retraining them!

Plus, at the corporate scale, you end up with fiefdoms and power struggles leading to metrics that agree with the current internal political situation, not industry or marketplace factors, and certainly not sustainability metrics. A business is a "machinery of people" and needs periodic tune-ups and reprogramming to go in vaguely the right direction.

In a lot of ways what it all comes down to is one of my personal favorite phrases, "fix ordinary things." Most of the time, we don't. We have a habit of putting off fixing all sorts of little things in our lives, even if our intentions are good, so of course we're caught by surprise by the disasters.

dooglius · on Oct 5, 2019

Does the difference in expected value (from preventing rare, expensive issues) actually exceed the cost of adding preventative maintenance? It is not obvious that this is the case.

kristianp · on Oct 5, 2019

That's why it's not done. However that means something disastrous happens every few years. On average the company is better off, the increased profit in good years outweighs the losses in the possible bad year. It's where black swan theory comes in. How do you calculate the probability of a rare event?

philjohn · on Oct 5, 2019

Do you think the ever increasing focus on increasing shareholder value has played a part in this?

bcaa7f3a8bbc · on Oct 5, 2019

What the Intel executives are doing here is called speculative execution! (pun intended)

robocat · on Oct 5, 2019

Interesting question in Dan's footnotes:

"In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that?"

undersuit · on Oct 5, 2019

To use an analogy: Because it's easier to break into the biggest bank once than to try to break into 100 smaller banks?

davnicwil · on Oct 5, 2019

Far from an expert, but applying general high level engineering principles I would guess it's related with the accelerating complexity needed to get linear performance gains, especially as the lower hanging hardware fruit has been picked.

I know a bit about the meltdown/spectre bugs and that seemingly was the case there: essentially a software hack to get more performance out of literally the same hardware. That was done to increase performance, but of course there were unintended consequences that nobody foresaw or cared to look for. It's almost obvious looking with hindsight - no free lunch etc.

hyperman1 · on Oct 5, 2019

I look at spectre as a proof that low hanging fruit of software bugs is finally getting picked. Theoretical knowledge of these kinds of bugs goes back for decades, but nobody bothered to look for them in practice. Why would you, when most programs contain enough invalid pointers to hack them at will. But the state of the art has advanced, so now people start looking at other stuff. Integer overflow are another example of a class of bugs ignored for a long time.

nsteel · on Oct 5, 2019

If there is an increase you could probably attribute some of it to the increasing costs of re-spinning. The same bug that will have been fixed in previous generations may not be deemed bad enough today.

mjw1007 · on Oct 5, 2019

I remember in the days of the "sig11 FAQ" [1] some of the gcc maintainers felt the (apparently hardware-related) problems occurred when using gcc more frequently than could be explained just by "compiling is the most stressful thing you're doing with your computer".

So I wouldn't be surprised if there was a processor bug or two around that time that nobody got to the bottom of.

https://www.tldp.org/FAQ/sig11/html/index.html

philjohn · on Oct 5, 2019

On one hand, CPU's are far more complex with things like branch prediction (speculative execution) to increase performance as we've hit a clock wall.

On the other, have a look at the errata sheets for older Intel CPU's: http://datasheets.chipdb.org/Intel/x86/386/specupdt/27287403...

seanmcdirmid · on Oct 5, 2019

As per the other responses, we have to also consider the possibility that we are also just getting better at finding these bugs, and similar bugs would have been found in previous processor generations if we looked as hard and proficiently as we do now.

londons_explore · on Oct 5, 2019

Centralised crash reporting had been around on windows since 1998 - I would imagine any bug that causes a crash, even in very rare circumstances, to be very obvious there.

wrs · on Oct 5, 2019

I worked on Windows crash reporting in its early several years. One of the very hard problems was to figure out the “circumstances” that cause a crash. Actually, even defining what “a crash” is turns out to be hard. A single bug can manifest in all sorts of ways, especially if it’s something very low-level like a processor or code generation bug. Inferring the commonalities out of a bunch of stack dumps to make it visible to humans that there really is just one bug there is probably still a largely unsolved problem.

seanmcdirmid · on Oct 6, 2019

Not all bugs cause crashes accidentally. Actually, with enough testing I would guess that most of them have to be triggered very explicitly.

The_rationalist · on Oct 5, 2019

Maybe it just get more coverage? I've seen AMD defects from ZEN that were hardware fixed later, here was what 100/150 defects?

BTW on the same topic, GPU drivers have many bugs, webrender maintain a wiki of bugs that affect them. Vulkan driver might have less bug as they seems smaller.

_mog1 · on Oct 5, 2019

Yeah, I suspect that this is mostly due to (1) the increase in samples that software (like Chrome) is receiving and (probably more importantly) (2) the improvements to the software that is capturing information about crashes.

scurvy · on Oct 5, 2019

Just you?

Intel FPU bugs from the 80's/90's. SPARC e-cache data parity errors (cost cutting) in the 90's/2000's. No thermal protections on AMD CPU's (Athlons?) in the 2000's to cut costs.

jfk13 · on Oct 5, 2019

Thinking back to the 70's, some of my early programming was on the 6502, which famously had a bug in its memory indirect jump instruction if the memory address given was at the very end of a 256-byte page (0x??FF).

It had other peculiarities, too, as described at https://en.wikipedia.org/wiki/MOS_Technology_6502#Bugs_and_q....

robocat · on Oct 5, 2019

Many of those "bugs" were probably just accepted engineering tradeoffs for reduced number of transistors.

When you have 3218 transistors, you don't make perfect features. The same thing happens when you have a very limited total instruction count to implement software.

olliej · on Oct 5, 2019

Which intel found bugs are you talking about? There was the late 90s sqrt bug (IIRC incorrectly initialized lookup table, unfortunately in ROM), and the lower than specified precision of the x87 transcendentals.

Those FPU bugs weren’t due to design complexity - the former was presumably a mask bug, and the latter is “maths is hard”.

There was an errata for some of the arm thumb2 cpu designs that led to incorrect branching when a jump instruction spanned a page boundary, which is the only cpu bug I ever encountered.

hyperman1 · on Oct 5, 2019

I remember the 80386 where the pushad/popad instructions could go wrong, up to locking up the whole cpu, but only if the next instruction did something specific to eax. So at least i386 had enough complexity to make the next instruction relevant for the previous, even if the thing had almost no pipelining.

akira2501 · on Oct 5, 2019

There was this bug:

https://en.wikipedia.org/wiki/Pentium_F00F_bug

olliej · on Oct 5, 2019

Ooh I forgot foof!

scurvy · on Oct 5, 2019

OP asked about CPU bugs, not the cause (design or implementation).

Either way, they're CPU bugs.

For pentium FPU, but I was referring to https://en.m.wikipedia.org/wiki/Pentium_FDIV_bug

olliej · on Oct 5, 2019

[edit, my reply was to a comment I made up in my head by conflating a bunch of other comments. Going to call it a comment prediction failure and blame my own CPU ;) ]

Oh, The one I was thinking of may have been FDIV rather than fsqrt. I mean the basic problem was the lookup table for the first NR guess was zero’d in a bunch of places and so the required number of NR iterations was incorrect and so sadness ensued.

But my point was those bugs weren’t due to complexity.

I think things like the f00f bug that was mentioned in another reply better fall into the “complexity is hard” area. The comment I was replying to was that it was complexity driven bugs, and I was trying to say that a number of the bugs we see aren’t in particularly complicated portions of the cpu - seriously the FPU bugs were all in some of the most sane portions of the CPU, vs f00f, spectre, etc which are a somewhat direct result of complex interactions of different parts of the cpu.

fulafel · on Oct 5, 2019

I think part of is that so much of computing now happens on closed and semi-closed platforms (GPUs, mobile SoCs, etc) that have driver and sw teams silently work around bugs.

bhouston · on Oct 5, 2019

And just in time compilation probably allows for software hot fixes to cpu bugs

fulafel · on Oct 6, 2019

Yes. I think this is one of the reasons that "shared tenancy" quality multitasking support in GPUs has not arrived yet.

craftyguy · on Oct 5, 2019

This website makes me worried about the future of representing text on the web:

> To use PolyGerrit, please enable JavaScript in your browser settings, and then refresh this page.

bla3 · on Oct 5, 2019

If this is really a cpu bug, why do they make this change windows-only? It's probably the only platform where they have enough users to be able to measure the crash, but shouldn't the fix be applied if building for x86 independent of OS? And the patch currently has an effect on Windows/arm where this CPU bug won't exist.

josefx · on Oct 5, 2019

As far as I can tell it isn't a generic fix. They had two functions that always crashed on a misaligned read from __security_cookie so they added a patch that forces the alignment for these functions. Since __security_cookie seems to be a windows specific stack protection mechanism it makes no sense to apply the workaround on all systems. Someone correct me if I got that wrong.

geekrax · on Oct 5, 2019

Nitpick: https://chromium-review.googlesource.com/c/v8/v8/+/1803775/2... seems unintentional.

stefan_ · on Oct 5, 2019

Presumably the CPU bug doesn't care for the exact functions involved, so wouldn't you have to do this for all of them?

tambre · on Oct 5, 2019

It being triggered certainly depends on very specific microarchitectural conditions that happen to be created by the instructions generated for those two functions and the 16-byte alignment also happens to be one of the required triggers.

Applying this fix to all other functions would certainly bloat the binary due to unnecessary padding and likely reduce performance for all, since the shipped binary is same for everyone.