I'm curious about a couple of points though. First, is that it is basically
just adding a cache colouring to the stack, right? In that case why do only
older HT CPUs have bad performance without it? And wouldn't it possibly make
even non HT CPUs possibly slightly more efficient WRT caching the stacks of
multiple processes?
it's a win on more than older HT cpus. It's just that those suffer it
the most... (since there you have 2 "cpus" share the cache, meaning you
get double the aliasing)
Second, on what workloads does performance suffer, can you remember? I wonder
if natural variations in the stack pointer as the program runs would mitigate
the effect of this on all but micro benchmarks?
one of the problem cases I remember is network daemons all waiting in
accept() for connections. All from the same codepath basically.
Randomizing the stackpointer is a gain for that on all cpus that have
finite affinity on their caches.
But even if that were so so, it seems simple enough that I don't have any
real problem with keeping it of course.
The reason my patch does it much more is that it makes it a step harder
to write exploits for stack buffer overflows.