On 31/01/2024 14:29, David Hildenbrand wrote:
Note that regarding NUMA effects, I mean when some memory access within the same
socket is faster/slower even with only a single node. On AMD EPYC that's
possible, depending on which core you are running and on which memory controller
the memory you want to access is located. If both are in different quadrants
IIUC, the access latency will be different.
I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.
Better use a system where people will actually run Linux production workloads
on, even if it is slower :)
[...]
I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on
that.
You should likely not focus on M2 results. Just pick a representative bare metal
machine where you get consistent, explainable results.
Nothing in the code is fine-tuned for a particular architecture so far, only
order-0 handling is kept separate.
BTW: I see the exact same speedups for dontneed that I see for munmap. For
example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
curious why you see a speedup for munmap but not for dontneed.
Ugh... ok, coming up.
Hopefully you were just staring at the wrong numbers (e.g., only with fork
patches). Because both (munmap/pte-dontneed) are using the exact same code path.
Ahh... I'm doing pte-dontneed, which is the only option in your original
benchmark - it does MADV_DONTNEED one page at a time. It looks like your new
benchmark has an additional "dontneed" option that does it in one shot. Which
option are you running? Assuming the latter, I think that explains it.