Nick Piggin <nickpiggin@xxxxxxxxxxxx> writes:
even non HT CPUs possibly slightly more efficient WRT caching the stacks of
multiple processes?
Not on x86 no because they normally have physically indexed caches
(except for L1, but that is not really preserved over a context switch)
HT is just a special case because two threads essentially share cache.
In theory it could help on non x86 CPUs with virtually indexed caches,
but it is doubtful if they don't need more advanced forms of cache colouring.
Second, on what workloads does performance suffer, can you remember? I wonder
if natural variations in the stack pointer as the program runs would mitigate
the effect of this on all but micro benchmarks?
iirc on lots of different workloas that run code on both virtual
CPUs at the same time. Without it you would get L1 cache thrashing,
which can slow things down quite a lot.
And yes it made a real difference. The P4 cache have some pecularities
("64K aliasing") that made the problem worse.