Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

From: Baolin Wang
Date: Fri May 31 2024 - 06:13:28 EST




On 2024/5/31 17:35, David Hildenbrand wrote:
On 30.05.24 04:04, Baolin Wang wrote:
Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have all the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option. By default all sizes will be set to "never"
except PMD size, which is set to "inherit". This ensures backward compatibility
with the anonymous shmem enabled of the top level, meanwhile also allows
independent control of anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

 From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

Let me summarize the takeaway from the bi-weekly MM meeting as I understood it, that includes Hugh's feedback on per-block tracking vs.

Thanks David for the summarization.

mTHP:

(1) Per-block tracking

Per-block tracking is currently considered unwarranted complexity in shmem.c. We should try to get it done without that. For any test cases that fail, we should consider if they are actually valid for shmem.

To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
is not possible at fallcoate() time, detecting zeropages later and
retrying to split+free might be an option, without per-block tracking.

(2) mTHP controls

As a default, we should not be using large folios / mTHP for any shmem, just like we did with THP via shmem_enabled. This is what this series currently does, and is aprt of the whole mTHP user-space interface design.

Further, the mTHP controls should control all of shmem, not only "anonymous shmem".

Yes, that's what I thought and in my TODO list.


Also, we should properly fallback within the configured sizes, and not jump "over" configured sizes. Unless there is a good reason.

(3) khugepaged

khugepaged needs to handle larger folios properly as well. Until fixed, using smaller THP sizes as fallback might prohibit collapsing a PMD-sized THP later. But really, khugepaged needs to be fixed to handle that. >
(4) force/disable

These settings are rather testing artifacts from the old ages. We should not add them to the per-size toggles. We might "inherit" it from the global one, though.

Sorry, I missed this. So I thould remove the 'force' and 'deny' option for each mTHP, right?


"within_size" might have value, and especially for consistency, we should have them per size.



So, this series only tackles anonymous shmem, which is a good starting point. Ideally, we'd get support for other shmem (especially during fault time) soon afterwards, because we won't be adding separate toggles for that from the interface POV, and having inconsistent behavior between kernel versions would be a bit unfortunate.


@Baolin, this series likely does not consider (4) yet. And I suggest we have to take a lot of the "anonymous thp" terminology out of this series, especially when it comes to documentation.

Sure. I will remove the "anonymous thp" terminology from the documentation, but want to still keep it in the commit message, cause I want to start from the anonymous shmem.


@Daniel, Pankaj, what are your plans regarding that? It would be great if we could get an understanding on the next steps on !anon shmem.