[RFC PATCH 0/2] mm: filemap: add filemap_grab_folios
From: Nikita Kalyazin
Date: Fri Jan 10 2025 - 10:47:25 EST
Based on David's suggestion for speeding up guest_memfd memory
population [1] made at the guest_memfd upstream call on 5 Dec 2024 [2],
this adds `filemap_grab_folios` that grabs multiple folios at a time.
Motivation
When profiling guest_memfd population and comparing the results with
population of anonymous memory via UFFDIO_COPY, I observed that the
former was up to 20% slower, mainly due to adding newly allocated pages
to the pagecache. As far as I can see, the two main contributors to it
are pagecache locking and tree traversals needed for every folio. The
RFC attempts to partially mitigate those by adding multiple folios at a
time to the pagecache.
Testing
With the change applied, I was able to observe a 10.3% (708 to 635 ms)
speedup in a selftest that populated 3GiB guest_memfd and a 9.5% (990 to
904 ms) speedup when restoring a 3GiB guest_memfd VM snapshot using a
custom Firecracker version, both on Intel Ice Lake.
Limitations
While `filemap_grab_folios` handles THP/large folios internally and
deals with reclaim artifacts in the pagecache (shadows), for simplicity
reasons, the RFC does not support those as it demonstrates the
optimisation applied to guest_memfd, which only uses small folios and
does not support reclaim at the moment.
Implementation
I am aware of existing filemap APIs operating on folio batches, however
I was not able to find one for the use case in question. I was also
thinking about making use of the `folio_batch` struct, but was not able
to convince myself that it was useful. Instead, a plain array of folio
pointers is allocated on stack and passed down the callchain. A bitmap
is used to keep track of indexes whose folios were already present in
the pagecache to prevent allocations. This does not look very clean to
me and I am more than open to hearing about better approaches.
Not being an expert in xarray, I do not know an idiomatic way to advance
the index if `xas_next` is called directly after instantiation of the
state that was never walked, so I used a call to `xas_set`.
While the series focuses on optimising _adding_ folios to the pagecache,
I was also experimenting with batching of pagecache _querying_.
Specifically, I tried to make use of `filemap_get_folios` instead of
`filemap_get_entry`, but I could not observe any visible speedup.
The series is applied on top of [1] as the 1st patch implements
`filemap_grab_folios`, while the 2nd patch makes use of it in the
guest_memfd's write syscall as a first user.
Questions:
- Does the approach look reasonable in general?
- Can the API be kept specialised to the non-reclaim-supported case or
does it need to be generic?
- Would it be sensible to add a specialised small-folio-only version of
`filemap_grab_folios` at the beginning and extend it to large folios
later on?
- Are there better ways to implement batching or even achieve the
optimisaton goal in another way?
[1]: https://lore.kernel.org/kvm/20241129123929.64790-1-kalyazin@xxxxxxxxxx/T/
[2]: https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAosPOk/edit?tab=t.0
Thanks
Nikita
Nikita Kalyazin (2):
mm: filemap: add filemap_grab_folios
KVM: guest_memfd: use filemap_grab_folios in write
include/linux/pagemap.h | 31 +++++
mm/filemap.c | 263 ++++++++++++++++++++++++++++++++++++++++
virt/kvm/guest_memfd.c | 176 ++++++++++++++++++++++-----
3 files changed, 437 insertions(+), 33 deletions(-)
base-commit: 643cff38ebe84c39fbd5a0fc3ab053cd941b9f94
--
2.40.1