Re: [RFC PATCH v2 0/9] mm: support zswap-backed large folio swapin

From: Fujunjie

Date: Sun May 31 2026 - 08:33:35 EST

On 5/30/2026 2:06 AM, Nhat Pham wrote:
> On Fri, May 29, 2026 at 5:17 AM fujunjie <fujunjie1@xxxxxx> wrote:
>>
>> Hi,
>>
>> This RFC explores large-folio swapin for ranges that are still fully backed
>> by zswap.
>>
>> Large swapin is currently disabled once zswap is in the picture. Anonymous
>> faults stop considering large orders after zswap has ever been enabled,
>> shmem does the same, and zswap_load() refuses large swapcache folios. That
>> keeps mixed zswap/disk cases safe, but it also loses the dense case where
>> every slot in an aligned 64K range is still resident in zswap.
>>
>> The series keeps the policy in common swapin code:
>>
>> - zswap reports backend facts and provides the large-folio load helper.
>> - swapin_sync() filters candidate orders by backend range.
>> - all-disk and zeromap ranges keep the existing Kairui large-swapin path.
>> - mixed zswap/disk ranges stay order-0.
>> - all-zswap ranges may use a 64K folio after locality admission.
>> - anon provides locality evidence from VMA hints and PTE young density.
>> - shmem starts with explicit VMA-hint evidence only.
>> - swap readahead uses its existing VMA/cluster window as locality
>> evidence; it does not also run the anon PTE-young rule.
>>
>> The backend range probe is only a snapshot. If the backend changes after a
>> fresh large swapcache folio is allocated, the common path drops that folio
>> and falls back to order-0. zswap_load() can also return -EAGAIN for the
>> same retry path. If a late fault retry keeps the large folio in swapcache
>> instead of deleting it, the cgroup v1 memsw swap owner is committed before
>> returning.
>>
>> This is mTHP/large-folio swapin. The mappings installed by do_swap_page()
>> are still PTE mappings, not PMD mappings. The expected win is fewer faults,
>> batched PTE/rmap work, and preserving the large folio across zswapin
>> instead of rebuilding the working set as order-0 pages.
>>
>> Prior art: Usama Arif posted a related RFC on 2024-10-18:
>>
>> mm: zswap: add support for zswapin of large folios
>> https://lore.kernel.org/linux-mm/20241018105026.2521366-1-usamaarif642@xxxxxxxxx/
>>
>> This RFC keeps the same broad goal, but moves admission into common swapin
>> code. zswap does not decide the policy. Mixed zswap/disk ranges are
>> rejected before large IO, and the first cap is 64K.
>>
>> This is a rewrite of the RFC posted on 2026-05-08:
>>
>> [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin
>> https://lore.kernel.org/linux-mm/tencent_8B437BE4F586C162950BF71954316C1EDB05@xxxxxx/
>>
>> The v1 series was anonymous-only and kept too much of the policy near the
>> anon fault and zswap paths. This version is rebuilt on top of Kairui Song's
>> common swapin infrastructure. It keeps admission in common swapin code,
>> rejects mixed zswap/disk large ranges, and adds separate locality producers
>> for anon, shmem and swap readahead.
>>
>> Performance and behavior
>> ========================
>>
>> The A/B tables are 10-run measurements. Elapsed values are seconds,
>> shown as mean +/- sample standard deviation. "phase" or "refault" is the
>> measured refault subphase. "zswpin" counts zswap loads. "pswpin" counts
>> swap-ins from the real swap device; pswpin=0 means the refaults were served
>> by zswap even when a disk swap device was configured. "RFC 64K" is the mean
>> number of successful 64K swapins.
>>
>> The numbers below show where the large path is used and where it is
>> rejected.
>>
>> zram-backed zswap microbench, 64K mTHP, 8G guest:
>>
>> +-----------------+----------------+----------------+--------+--------+--------+----------+
>> | workload | base elapsed | RFC elapsed | delta | phase | zswpin | RFC 64K |
>> +-----------------+----------------+----------------+--------+--------+--------+----------+
>> | usama_1g | 11.260+/-0.235 | 10.301+/-0.140 | -8.5% | -22.2% | 1.000x | 16381.1 |
>> | nohint_seq64 | 4.398+/-0.085 | 4.025+/-0.022 | -8.5% | -21.1% | 1.000x | 6221.1 |
>> | seqhint_seq64 | 4.283+/-0.060 | 3.948+/-0.062 | -7.8% | -20.6% | 1.000x | 6223.5 |
>> | stride64_sparse | 3.095+/-0.051 | 3.086+/-0.025 | -0.3% | +5.8% | 1.002x | 1.0 |
>> | random64_sparse | 3.095+/-0.046 | 3.076+/-0.016 | -0.6% | +0.7% | 1.001x | 0.0 |
>> | random64_full | 4.423+/-0.067 | 4.405+/-0.018 | -0.4% | +0.1% | 1.000x | 0.0 |
>> +-----------------+----------------+----------------+--------+--------+--------+----------+
>>
>> The usama_1g row follows the shape of the 2024 RFC benchmark: allocate 1G,
>> fill it with compressible per-page data, reclaim it through memory.reclaim,
>> then time the full integrity-check refault. The seq64 rows use a 512M
>> target and 768M of pressure. "sparse" touches one 4K page per 64K region, while
>> "full" touches every 4K page. "seqhint" uses MADV_SEQUENTIAL; "nohint" does
>> not.
>>
>> Virtio-block swap device present, zswap enabled:
>>
>> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
>> | workload | base elapsed | RFC elapsed | delta | refault | pswpin | zswpin | RFC 64K |
>> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
>> | seq64 | 4.399+/-0.100 | 4.279+/-0.216 | -2.7% | -10.5% | 0 | 1.000x | 3110.7 |
>> | stride64_sparse | 3.103+/-0.047 | 3.119+/-0.086 | +0.5% | +6.2% | 0 | 0.999x | 0.0 |
>> | random64_sparse | 3.142+/-0.112 | 3.097+/-0.030 | -1.4% | -2.2% | 0 | 0.999x | 0.1 |
>> | random64_full | 4.473+/-0.147 | 4.445+/-0.088 | -0.6% | +0.9% | 0 | 1.000x | 0.4 |
>> +-----------------+---------------+---------------+--------+---------+--------+--------+---------+
>>
>> This run uses a real block swap device, but the refaulted data stayed in
>> zswap. It covers the all-zswap hit path with disk swap configured, not disk
>> read IO.
>>
>> Virtio-block pressure/mixed run, zswap max_pool_percent=1,
>> low-compressibility full fill:
>>
>> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
>> | workload | base elapsed | RFC elapsed | delta | refault | pswpin base/RFC | RFC zswpin | RFC 64K | fallback |
>> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
>> | seq64_full_pressure | 5.908+/-0.195 | 5.790+/-0.235 | -2.0% | +3.0% | 90258/99038 | 20327 | 0.0 | 3730 |
>> | random64_sparse_full_pressure | 5.104+/-0.069 | 5.068+/-0.090 | -0.7% | -9.1% | 6201/6196 | 1297 | 0.0 | 0 |
>> +-------------------------------+---------------+---------------+--------+---------+----------------+------------+---------+----------+
>>
>> This run reaches the disk-backed path: pswpin is non-zero in both base and
>> RFC. It is mainly fallback coverage. The RFC does not install 64K folios
>> for these disk/mixed-heavy ranges.
>
> Ok this results above look good. Basically, if we don't have spatial
> locality in access patterns, we don't do THP zswapin. Nice.
>
>>
>> Policy matrix, virtio-block swap device present:
>>
>> +------------------------------+----+------+--------+--------+-------+----------+
>> | case | pc | hint | pswpin | zswpin | zswpwb| 64K in |
>> +------------------------------+----+------+--------+--------+-------+----------+
>> | pc0_seq | 0 | none | 0 | 99559 | 0 | 0 |
>> | pc3_seq | 3 | none | 0 | 99498 | 0 | 0 |
>> | pc4_seq | 4 | none | 0 | 99512 | 0 | 3109 |
>> | pc5_seq | 5 | none | 0 | 99657 | 0 | 3113 |
>> | hint_none_random_sparse | 5 | none | 0 | 6265 | 0 | 0 |
>> | hint_random_seq | 5 | rand | 0 | 99488 | 0 | 0 |
>> | mixed_seq_full | 5 | none | 97725 | 20147 | 84 | 569 |
>> | mixed_random_sparse_full | 5 | none | 6230 | 1302 | 0 | 0 |
>> +------------------------------+----+------+--------+--------+-------+----------+
>>
>> The pc rows show the readahead-window gate. The hint_random_seq row shows
>> the explicit random hint veto. The mixed rows use a small zswap pool to
>> force disk/zswap split backing; most mixed ranges are rejected, while any
>> remaining 64K successes were all-zswap at the time of the fault.
>>
>> Kbuild pressure, zram swap, 384M memcg:
>>
>> +----------------------+----------+----------+--------+--------+----------+
>> | setup | base | RFC | delta | zswpin | RFC 64K |
>> +----------------------+----------+----------+--------+--------+----------+
>> | zram swap, 384M memcg| 2060.323 | 2047.516 | -0.6% | 0.991x | 2797 |
>> +----------------------+----------+----------+--------+--------+----------+
>>
>> This is a single-run zram pressure smoke. It did not show Kbuild
>> regression, and the RFC run installed 64K zswap-backed folios. The result
>> should not be read as a tuned-performance claim.
>>
>> Kbuild pressure, virtio-block swap device, 512M memcg:
>>
>> +-------------------------+----------+----------+--------+--------+----------+
>> | setup | base | RFC | delta | pswpin | RFC 64K |
>> +-------------------------+----------+----------+--------+--------+----------+
>> | disk swap, 512M memcg | 1420.671 | 1409.263 | -0.8% | 0 | 7497 |
>> +-------------------------+----------+----------+--------+--------+----------+
>>
>> This is a single-run pressure smoke. The disk-swap Kbuild run also stayed
>> on the all-zswap hit path, so it is pressure coverage with a disk swap device
>> present rather than proof of disk-read large swapin.
>
> Why a single-run?

I did run Kbuild a few times while debugging the series and did not see a
significant difference either way. Because of that I only kept one fresh run
with the final tree before sending the RFC, so this should be read only as a
smoke test, not as performance evidence.

For the next version I will rerun Kbuild properly with multiple fresh
iterations and report it, so it can be used as a more reliable
performance comparison instead of just smoke coverage.

>
>>
>> Shmem smoke, tmpfs huge=always, 64K shmem mTHP:
>>
>> +----------------------------+---------------+---------+-------------+----------+
>> | case | refault hint | touched | 64K shmem | 64K in |
>> +----------------------------+---------------+---------+-------------+----------+
>> | nohint_seq | none | 65536 | 4096 | 0 |
>> | seq_refault_hint | sequential | 65536 | 4096 | 4096 |
>> | random_refault_hint_sparse | random | 4096 | 4096 | 0 |
>> +----------------------------+---------------+---------+-------------+----------+
>>
>> That matches the current shmem producer: explicit sequential refault hints
>> allow large zswap swapin; no hint and random hints do not.
>>
>> What this RFC does not establish
>> ================================
>>
>> The 64K cap is deliberate, but it is not tuned. The anon PTE-young rule is
>> only anon evidence. Shmem has the framework and explicit VMA hints in this
>> RFC, not a page-cache locality producer. For larger orders, the anon
>> producer should probably use bounded sampling instead of walking every PTE
>> in a 1M or larger candidate range. The mixed-backend tests cover fallback
>> behavior, but this series does not add mixed zswap/disk large IO.
>
> The mixed IO can be deferred, but I think we should figure out a rule
> to extend this hint to arbitrarily sized ranges, and preferrably shmem
> too.

That makes sense.

The current 64K cap was intentionally conservative, but the locality rule is
too tied to that size. For v3 I will look at making the admission rule
order-independent, probably with bounded sampling rather than walking every
PTE for larger ranges.

For shmem, this RFC only uses explicit VMA hints, so it does not yet have a
real page-cache locality producer. I will think through how to add a shmem
producer with similar semantics, so the rule is not anon-only.

Thanks,
Fujunjie