These are indeed very good articles. Even if you are all-in to nix/BSD, reading this can be quite eye opening. While at the lowest level the Windows kernel is basically a black box over a Unix kernel, there are in some places unique design choices that lead to unique problems. Sadly we do not usually get to see their resolution or the impact they had on MS's customers.
Intel is 1 year late on delivering a Jasper lake (Gemini lake refresh,) Tremont is also way too late. Elkhart lake (Xeon branded Atom) delayed indefinitely. Skyhawk lake is said to arrive 2Q-3Q late next year.
Mercury lake seem to be certainly cancelled as early buyers are getting refunds after 2 years of repeated shipping delays
And yes, the "Intel's Saviour" Lakefield is just coming out of fabs as we speak
Chinese laptop OEMs are scratching their heads very hard now. All want to jump the ship for AMD, but Intel holds them hostage through completely anticompetitive "preferred partner" agreements
Things are so bad among OEMs in China that people are now making laptops with desktop chips
This makes me worried about the future of processors. I know bugs have always existed and sometimes they were very serious but they seem to have increased in frequency in the last few years.
I've met with Intel execs; one of them told me, not in as much detail as that anonymous engineer expounds, but simply that they reduced their emphasis on validation. So there's no need to speculate here (pun intended).
Seems to be related to the financialization of everything, where financial engineering supplants actual engineering and financial culture replaces engineering culture at the C-level. Here’s a good article on Boeing about that:
I have an essentially identical question: Factory owners do not want their factories to explode and lost ten million dollars by the virtue of profit-seeking capitalism. Yet, being profit-oriented and neglecting basic safety and preventative maintenance is a very common phenomenon, always happens, and seen as the evil of "business people", and it happens even in the most powerful company. I don't understand it.
Is it simply cognitive bias at work, like LessWrong often says? But I think there must be deeper reasons than that. Did anyone write good books on this subject? Either sociology, psychology, management & decision-making or economics prospective is welcomed.
It's short-termism, which is a common threat to society at large. Think of cities that build in sites vulnerable to earthquakes or floods, for example. After the event hits, everyone says, "Oh, we will never do that again," and it is true for approximately one generation. Then, unless the culture has evolved to value metrics coinciding with these kinds of long-term sustainability issues, they immediately start to relax any preventative policies.
In business, all the cycles are shorter, there's always a "new thing", and next to nobody has the kind of deep institutional experience you would see in a big topic like city planning. Thus the valuation metrics of all agents will quickly fall to current market pricing, and competition may act to hold those metrics in place - if you are the only restaurant in town that doesn't cut corners, customers will complain about how overpriced you are. Regulation, labor and consumers exercising their power all have a role in changing what a business can or can't do by raising the floors on acceptable practice.
But let's say you are exceptionally good at operating a restaurant on your own - quality everywhere - and grow to have a chain. Now you need to hire managers, and the viable hiring pool consists of the people who were cutting corners before - because there is literally nobody else out there. Good luck retraining them!
Plus, at the corporate scale, you end up with fiefdoms and power struggles leading to metrics that agree with the current internal political situation, not industry or marketplace factors, and certainly not sustainability metrics. A business is a "machinery of people" and needs periodic tune-ups and reprogramming to go in vaguely the right direction.
In a lot of ways what it all comes down to is one of my personal favorite phrases, "fix ordinary things." Most of the time, we don't. We have a habit of putting off fixing all sorts of little things in our lives, even if our intentions are good, so of course we're caught by surprise by the disasters.
Does the difference in expected value (from preventing rare, expensive issues) actually exceed the cost of adding preventative maintenance? It is not obvious that this is the case.
That's why it's not done. However that means something disastrous happens every few years. On average the company is better off, the increased profit in good years outweighs the losses in the possible bad year. It's where black swan theory comes in. How do you calculate the probability of a rare event?
"In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that?"
Far from an expert, but applying general high level engineering principles I would guess it's related with the accelerating complexity needed to get linear performance gains, especially as the lower hanging hardware fruit has been picked.
I know a bit about the meltdown/spectre bugs and that seemingly was the case there: essentially a software hack to get more performance out of literally the same hardware. That was done to increase performance, but of course there were unintended consequences that nobody foresaw or cared to look for. It's almost obvious looking with hindsight - no free lunch etc.
I look at spectre as a proof that low hanging fruit of software bugs is finally getting picked. Theoretical knowledge of these kinds of bugs goes back for decades, but nobody bothered to look for them in practice. Why would you, when most programs contain enough invalid pointers to hack them at will. But the state of the art has advanced, so now people start looking at other stuff. Integer overflow are another example of a class of bugs ignored for a long time.
If there is an increase you could probably attribute some of it to the increasing costs of re-spinning. The same bug that will have been fixed in previous generations may not be deemed bad enough today.
I remember in the days of the "sig11 FAQ" [1] some of the gcc maintainers felt the (apparently hardware-related) problems occurred when using gcc more frequently than could be explained just by "compiling is the most stressful thing you're doing with your computer".
So I wouldn't be surprised if there was a processor bug or two around that time that nobody got to the bottom of.
As per the other responses, we have to also consider the possibility that we are also just getting better at finding these bugs, and similar bugs would have been found in previous processor generations if we looked as hard and proficiently as we do now.
Centralised crash reporting had been around on windows since 1998 - I would imagine any bug that causes a crash, even in very rare circumstances, to be very obvious there.
I worked on Windows crash reporting in its early several years. One of the very hard problems was to figure out the “circumstances” that cause a crash. Actually, even defining what “a crash” is turns out to be hard. A single bug can manifest in all sorts of ways, especially if it’s something very low-level like a processor or code generation bug. Inferring the commonalities out of a bunch of stack dumps to make it visible to humans that there really is just one bug there is probably still a largely unsolved problem.
Maybe it just get more coverage?
I've seen AMD defects from ZEN that were hardware fixed later, here was what 100/150 defects?
BTW on the same topic, GPU drivers have many bugs, webrender maintain a wiki of bugs that affect them.
Vulkan driver might have less bug as they seems smaller.
Yeah, I suspect that this is mostly due to (1) the increase in samples that software (like Chrome) is receiving and (probably more importantly) (2) the improvements to the software that is capturing information about crashes.
Intel FPU bugs from the 80's/90's. SPARC e-cache data parity errors (cost cutting) in the 90's/2000's. No thermal protections on AMD CPU's (Athlons?) in the 2000's to cut costs.
Thinking back to the 70's, some of my early programming was on the 6502, which famously had a bug in its memory indirect jump instruction if the memory address given was at the very end of a 256-byte page (0x??FF).
Many of those "bugs" were probably just accepted engineering tradeoffs for reduced number of transistors.
When you have 3218 transistors, you don't make perfect features. The same thing happens when you have a very limited total instruction count to implement software.
Which intel found bugs are you talking about? There was the late 90s sqrt bug (IIRC incorrectly initialized lookup table, unfortunately in ROM), and the lower than specified precision of the x87 transcendentals.
Those FPU bugs weren’t due to design complexity - the former was presumably a mask bug, and the latter is “maths is hard”.
There was an errata for some of the arm thumb2 cpu designs that led to incorrect branching when a jump instruction spanned a page boundary, which is the only cpu bug I ever encountered.
I remember the 80386 where the pushad/popad instructions could go wrong, up to locking up the whole cpu, but only if the next instruction did something specific to eax. So at least i386 had enough complexity to make the next instruction relevant for the previous, even if the thing had almost no pipelining.
[edit, my reply was to a comment I made up in my head by conflating a bunch of other comments. Going to call it a comment prediction failure and blame my own CPU ;) ]
Oh, The one I was thinking of may have been FDIV rather than fsqrt. I mean the basic problem was the lookup table for the first NR guess was zero’d in a bunch of places and so the required number of NR iterations was incorrect and so sadness ensued.
But my point was those bugs weren’t due to complexity.
I think things like the f00f bug that was mentioned in another reply better fall into the “complexity is hard” area. The comment I was replying to was that it was complexity driven bugs, and I was trying to say that a number of the bugs we see aren’t in particularly complicated portions of the cpu - seriously the FPU bugs were all in some of the most sane portions of the CPU, vs f00f, spectre, etc which are a somewhat direct result of complex interactions of different parts of the cpu.
I think part of is that so much of computing now happens on closed and semi-closed platforms (GPUs, mobile SoCs, etc) that have driver and sw teams silently work around bugs.
If this is really a cpu bug, why do they make this change windows-only? It's probably the only platform where they have enough users to be able to measure the crash, but shouldn't the fix be applied if building for x86 independent of OS? And the patch currently has an effect on Windows/arm where this CPU bug won't exist.
As far as I can tell it isn't a generic fix. They had two functions that always crashed on a misaligned read from __security_cookie so they added a patch that forces the alignment for these functions. Since __security_cookie seems to be a windows specific stack protection mechanism it makes no sense to apply the workaround on all systems. Someone correct me if I got that wrong.
It being triggered certainly depends on very specific microarchitectural conditions that happen to be created by the instructions generated for those two functions and the 16-byte alignment also happens to be one of the required triggers.
Applying this fix to all other functions would certainly bloat the binary due to unnecessary padding and likely reduce performance for all, since the shipped binary is same for everyone.
This is a great one to start with, "24-core CPU and I can’t type an email": https://randomascii.wordpress.com/2018/08/16/24-core-cpu-and...