Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

From: David Hildenbrand
Date: Wed Jan 31 2024 - 09:30:13 EST

Next message: Pierre Gondois: "Re: [PATCH V2 4/4] cpufreq: scmi: Register for limit change notifications"
Previous message: Pierre Gondois: "Re: [PATCH 2/3] firmware: arm_scmi: Add support for marking certain frequencies as boost"
In reply to: Ryan Roberts: "Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP"
Next in thread: Ryan Roberts: "Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Note that regarding NUMA effects, I mean when some memory access within the same
socket is faster/slower even with only a single node. On AMD EPYC that's
possible, depending on which core you are running and on which memory controller
the memory you want to access is located. If both are in different quadrants
IIUC, the access latency will be different.

I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.

Better use a system where people will actually run Linux production workloads on, even if it is slower :)

[...]

I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on
that.

You should likely not focus on M2 results. Just pick a representative bare metal
machine where you get consistent, explainable results.

Nothing in the code is fine-tuned for a particular architecture so far, only
order-0 handling is kept separate.

BTW: I see the exact same speedups for dontneed that I see for munmap. For
example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
curious why you see a speedup for munmap but not for dontneed.

Ugh... ok, coming up.

Hopefully you were just staring at the wrong numbers (e.g., only with fork patches). Because both (munmap/pte-dontneed) are using the exact same code path.

--
Cheers,

David / dhildenb

Next message: Pierre Gondois: "Re: [PATCH V2 4/4] cpufreq: scmi: Register for limit change notifications"
Previous message: Pierre Gondois: "Re: [PATCH 2/3] firmware: arm_scmi: Add support for marking certain frequencies as boost"
In reply to: Ryan Roberts: "Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP"
Next in thread: Ryan Roberts: "Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]