Hacker News

adrian_b · 2025-09-30T07:15:39 1759216539

That looks more like the effect of a bad compiler.

On most modern CPUs, including Intel/AMD & ARM-based, memory accesses with relative-addressing to the stack have at least the same performance if not much better performance than memory accesses through pointers to the heap. The stack is also normally cached automatically by the memory prefetcher.

So whichever was the cause of your performance increase was not an inefficiency of the stack accesses, but some other difference in the code generated by the compiler. Such an unexpected and big performance difference could normally be caused only by a compiler bug.

A special case when variables allocated in the stack can lead to low performance is when a huge amount of variables are allocated or many arrays are allocated, and then they are only sparsely used and the initial stack is much smaller. Then any new allocation of an array or big bunch of variables may exceed the stack size, which will cause a page fault, so that the operating system will grow the stack by one memory page.

This kind of transparent memory allocation by page faults can be much slower than the explicit memory allocation done by malloc or new. This is why big arrays should normally be allocated either statically or in the heap.

osigurdson · 2025-09-30T13:08:08 1759237688

What they are saying is the stackalloc approach causes no GC pressure. You can run that in a tight / infinite loop with essentially no downside. Using a regular heap allocated array in the same situation will hammer the GC.

C# doesn't do escape analysis and automatically put things on the stack like (for example) Go does. However the stack size is limited in C# so I wouldn't suggest going too crazy with stackalloc and deep stacks / recursion. You don't want a stack overflow!

ygra · 2025-09-30T13:35:23 1759239323

> C# doesn't do escape analysis and automatically put things on the stack

There has been work ongoing in this direction since .NET 9 at least, but the effect is very limited currently. The following code however, has no allocations at runtime, despite having an object creation in the code:

https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcDMACR2DC2A3ut...

Izikiel43 · 2025-09-30T14:09:42 1759241382

> C# doesn't do escape analysis

In net9 they started on that, and on net10 they have improved it, you can check the performance blogpost by Stephen toub for net10 for more info

uecker · 2025-09-30T07:18:55 1759216735

I do not understand this comment. A stack access should never be more expensive than access to malloced data. This is easy to see: Malloc gives you a pointer. An address computation to find a variable on the stack also gives you a pointer but is much cheaper than malloc. If the compiler recomputes the address to a stack variable then this must be because it is deemed cheaper than wasting a register to cache the pointer. Nowadays compilers can transform malloc to stack allocations variables in certain cases.

whizzter · 2025-09-30T10:28:36 1759228116

Not trying to mock here, but did you program assembly in the 80s/90s and/or look at compilers involved with CPU's of comparable complexity to the 68000 series or older?

Yes, relative addressing on older machines like that does hurt a ton (I was coding a jam-game of the GameBoy recently and had to re-orient some of my code away from stack-usage since I've gotten a tad "lazy" over the years and didn't care to program 100% assembly).

On modern machines (since circa Pentium generation CPU's) that can do one cycle-multiplications memory-offset accesses are often irrelevant for the performance of access operations.

More importantly, at about 500mhz-1ghz the sheer latency of main-memory accesses vs cached accesses will start to become your main worry since cache-misses start to approach _hundreds of cycles_ (this is why we have the entire data-oriented-design philosophy growing), while on the other hand the modern CPU's can pipeline many instructions to even make the impact of bounds checks more or less insignificant compared cache misses.

nlitened · 2025-09-30T09:02:47 1759222967

> Just a few days ago I changed stack access to heap access in a larger generated perfect hash and got 100x better performance

Link me the code so that I can show what you did wrong

rurban · 2025-09-30T11:21:43 1759231303

https://github.com/rurban/cmph/commit/4bb99882cc21fd44e86602...

I did not nothing wrong :)

And I remember better now. I changed stack access to immediate global access to const arrays. And insane was the compilation time, not so much the runtime. -O2 timed out, and if it compiled it was 10x slower run-time. Now compilation is immediate.

This compiles large perfect hashes to C code. Like gperf, just better and for huge arrays.

nlitened · 2025-09-30T11:45:47 1759232747

Thank you for linking.

In this case, as I understand, you switched from a local array which is re-initialized with values upon each function invocation to a global array that is pre-initialized before even the program starts. So my understanding is you're measuring the speed between repeated re-initialization of your local array and no initialization at all — not between "relative" stack access and "absolute" heap or global memory access.

As I understand, to do it the way you wanted initially, your local array must be `static const` instead of just const (so that there's a single copy of it, essentially making it a global variable as well). To be honest, I am not sure why C compiler doesn't optimize it this way automatically, likely I don't know C spec well enough to know the reasons.

davidatbu · 2025-09-30T06:38:04 1759214284

Super interesting! Was this C# or something? is there a write-up/mini-blogpost about this somewhere?