Re: [RFC PATCH v1 0/2] Swap-out small-sized THP without splitting

From: Ryan Roberts
Date: Fri Oct 13 2023 - 12:32:18 EST


On 11/10/2023 07:37, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@xxxxxxx> writes:
>
> [...]
>
>> Finally on testing, I've run the mm selftests and see no regressions, but I
>> don't think there is anything in there specifically aimed towards swap? Are
>> there any functional or performance tests that I should run? It would certainly
>> be good to confirm I haven't regressed PMD-size THP swap performance.
>
> I have used swap sub test case of vm-scalbility to test.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/

I ended up using `usemem`, which is the core of this test suite, but deviated
from the pre-canned test case to allow me to use anonymous memory and get
numbers for small-sized THP (this is a very useful tool - thanks for pointing it
out!)

I've run the tests on Ampere Altra, set up with a 35G block ram device as the
swap device and from inside a memcg limited to 40G memory. I've then run
`usemem` with 70 processes (each has its own core), each allocating and writing
1G of memory. I've repeated everything 5 times and taken the mean and stdev:


Mean Performance Improvement vs 4K/baseline

| alloc size | baseline | remove-huge-flag | swap-file-small-thp |
| | v6.6-rc4+anonfolio | + patch 1 | + patch 2 |
|:-----------|--------------------:|--------------------:|--------------------:|
| 4K Page | 0.0% | 2.3% | 9.1% |
| 64K THP | -44.1% | -46.3% | 30.6% |
| 2M THP | 56.0% | 54.2% | 60.1% |


Standard Deviation as Percentage of Mean

| alloc size | baseline | remove-huge-flag | swap-file-small-thp |
| | v6.6-rc4+anonfolio | + patch 1 | + patch 2 |
|:-----------|--------------------:|--------------------:|--------------------:|
| 4K Page | 3.4% | 7.1% | 1.7% |
| 64K THP | 1.9% | 5.6% | 7.7% |
| 2M THP | 1.9% | 2.1% | 3.2% |


I don't see any meaningful performance cost to removing the HUGE flag, so
hopefully this gives us confidence to move forward with patch 1.

You can indeed see the performance regression in the baseline when THP is
configured to allocate small-sized THP only (in this case 64K). And you can see
the regression is fixed by patch 2, which avoids splitting the THP and thus
avoids the extra TLBIs. This correlates with what I saw in kernel compilation
workload.

Huang Ying, based on these results, do you still want me to persue a per-cpu
solution to avoid potential contention on the swap info lock? - I proposed in
the thread against patch 2 to do this in the swap_slots layer if so, rather than
in swapfile.c directly (I'm not sure how your original proposal would actually
work?). But based on these results, its not obvious to me that there is a
definite problem here, and it might be simpler to avoid the complexity?

Thanks,
Ryan

>
> --
> Best Regards,
> Huang, Ying