Re: [PATCH RFC v2 2/5] mm: guest_memfd: Allow folios to be accessible to host

From: Vishal Annapurve
Date: Wed Oct 09 2024 - 16:14:44 EST


On Fri, Aug 30, 2024 at 3:55 AM Elliot Berman <quic_eberman@xxxxxxxxxxx> wrote:
>
> Memory given to a confidential VM sometimes needs to be accessible by
> Linux. Before the VM starts, Linux needs to load some payload into
> the guest memory. While the VM is running, the guest may make some of
> the memory accessible to the host, e.g. to share virtqueue buffers. We
> choose less-used terminology here to avoid confusion with other terms
> (i.e. private). Memory is considered "accessible" when Linux (the host)
> can read/write to that memory. It is considered "inaccessible" when
> reads/writes aren't allowed by the hypervisor.
>
> Careful tracking of when the memory is supposed to be "inaccessible" and
> "accessible" is needed because hypervisors will fault Linux if we
> incorrectly access memory which shouldn't be accessed. On arm64 systems,
> this is a translation fault. On x86 systems, this could be a machine
> check.
>
> After discussion in [1], we are using 3 counters to track the state of a
> folio: the general folio_ref_count, a "safe" ref count, and an
> "accessible" counter.

This is a long due response after discussion at LPC. In order to
support hugepages with guest memfd, the current direction is to split
hugepages on memory conversion [1]. During LPC session [2], we
discussed the need of reconstructing split hugepages before giving
back the memory to hugepage allocator. After the session I discussed
this topic with David H and I think that the best way to handle
reconstruction would be to get a callback from folio_put when the last
refcount of pages backing guest shared memory is dropped. Reason being
that get/pin_user_pages* don't increase inode refcount and so
guest_memfd inode can get cleaned up while backing memory is still
pinned e.g. by VFIO pinning memory ranges to setup IOMMU pagetables.

If such a callback is supported, I believe we don't need to implement
the "safe refcount" logic. shared -> private conversion (should be
similar for truncation) handling by guest_memfd may look like:
1) Drop all the guest_memfd internal refcounts on pages backing
converted ranges.
2) Tag such pages such that core-mm invokes a callback implemented by
guest_memfd when the last refcount gets dropped.
At this point I feel that the mechanism to achieve step 2 above should
be a small modification in core-mm logic which should be generic
enough and will not need piggy-backing on ZONE_DEVICE memory handling
which already carries similar logic [3].

Private memory doesn't need such a special callback since we discussed
at LPC about the desired policy to be:
1) Guest memfd owns all long-term refcounts on private memory.
2) Any short-term refcounts distributed outside guest_memfd should be
protected by folio locks.
3) On truncation/conversion, guest memfd private memory users will be
notified to unmap/refresh the memory mappings.

i.e. After private -> shared conversion, it would be guaranteed that
there are no active users of guest private memory.

[1] Linux MM Alignment Session:
https://lore.kernel.org/all/20240712232937.2861788-1-ackerleytng@xxxxxxxxxx/
[2] https://lpc.events/event/18/contributions/1764/
[3] https://elixir.bootlin.com/linux/v6.11.2/source/mm/swap.c#L117

>
> Transition between accessible and inaccessible is allowed only when:
> 0. The folio is locked
> 1. The "accessible" counter is at 0.
> 2. The "safe" ref count equals the folio_ref_count.
> 3. The hypervisor allows it.
>
> The accessible counter can be used by Linux to guarantee the page stays
> accessible, without elevating the general refcount. When the accessible
> counter decrements to 0, we attempt to make the page inaccessible. When
> the accessible counters increments to 1, we attempt to make the page
> accessible.
>
> We expect the folio_ref_count to be nearly zero. The "nearly" amount is
> determined by the "safe" ref count value. The safe ref count isn't a
> signal whether the folio is accessible or not, it is only used to
> compare against the folio_ref_count.
>
> The final condition to transition between (in)accessible is whether the
> ->prepare_accessible or ->prepare_inaccessible guest_memfd_operation
> passes. In arm64 pKVM/Gunyah terms, the fallible "prepare_accessible"
> check is needed to ensure that the folio is unlocked by the guest and
> thus accessible to the host.
>
> When grabbing a folio, the client can either request for it to be
> accessible or inaccessible. If the folio already exists, we attempt to
> transition it to the state, if not already in that state. This will
> allow KVM or userspace to access guest_memfd *before* it is made
> inaccessible because KVM and userspace will use
> GUEST_MEMFD_GRAB_ACCESSIBLE.
>
> [1]: https://lore.kernel.org/all/a7c5bfc0-1648-4ae1-ba08-e706596e014b@xxxxxxxxxx/
>
> Signed-off-by: Elliot Berman <quic_eberman@xxxxxxxxxxx>
> ---
> include/linux/guest_memfd.h | 10 ++
> mm/guest_memfd.c | 238 +++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 236 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h
> index 8785b7d599051..66e5d3ab42613 100644
> --- a/include/linux/guest_memfd.h
> +++ b/include/linux/guest_memfd.h
> @@ -22,17 +22,27 @@ struct guest_memfd_operations {
> int (*invalidate_begin)(struct inode *inode, pgoff_t offset, unsigned long nr);
> void (*invalidate_end)(struct inode *inode, pgoff_t offset, unsigned long nr);
> int (*prepare_inaccessible)(struct inode *inode, struct folio *folio);
> + int (*prepare_accessible)(struct inode *inode, struct folio *folio);
> int (*release)(struct inode *inode);
> };
>
> +enum guest_memfd_grab_flags {
> + GUEST_MEMFD_GRAB_INACCESSIBLE = (0UL << 0),
> + GUEST_MEMFD_GRAB_ACCESSIBLE = (1UL << 0),
> +};
> +
> enum guest_memfd_create_flags {
> GUEST_MEMFD_FLAG_CLEAR_INACCESSIBLE = (1UL << 0),
> };
>
> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags);
> +void guest_memfd_put_folio(struct folio *folio, unsigned int accessible_refs);
> +void guest_memfd_unsafe_folio(struct folio *folio);
> struct file *guest_memfd_alloc(const char *name,
> const struct guest_memfd_operations *ops,
> loff_t size, unsigned long flags);
> bool is_guest_memfd(struct file *file, const struct guest_memfd_operations *ops);
> +int guest_memfd_make_accessible(struct folio *folio);
> +int guest_memfd_make_inaccessible(struct folio *folio);
>
> #endif
> diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c
> index c6cd01e6064a7..62cb576248a9d 100644
> --- a/mm/guest_memfd.c
> +++ b/mm/guest_memfd.c
> @@ -4,9 +4,33 @@
> */
>
> #include <linux/anon_inodes.h>
> +#include <linux/atomic.h>
> #include <linux/falloc.h>
> #include <linux/guest_memfd.h>
> #include <linux/pagemap.h>
> +#include <linux/wait.h>
> +
> +#include "internal.h"
> +
> +static DECLARE_WAIT_QUEUE_HEAD(safe_wait);
> +
> +/**
> + * struct guest_memfd_private - private per-folio data
> + * @accessible: number of kernel users expecting folio to be accessible.
> + * When zero, the folio converts to being inaccessible.
> + * @safe: number of "safe" references to the folio. Each reference is
> + * aware that the folio can be made (in)accessible at any time.
> + */
> +struct guest_memfd_private {
> + atomic_t accessible;
> + atomic_t safe;
> +};
> +
> +static inline int base_safe_refs(struct folio *folio)
> +{
> + /* 1 for filemap */
> + return 1 + folio_nr_pages(folio);
> +}
>
> /**
> * guest_memfd_grab_folio() -- grabs a folio from the guest memfd
> @@ -35,21 +59,56 @@
> */
> struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags)
> {
> - unsigned long gmem_flags = (unsigned long)file->private_data;
> + const bool accessible = flags & GUEST_MEMFD_GRAB_ACCESSIBLE;
> struct inode *inode = file_inode(file);
> struct guest_memfd_operations *ops = inode->i_private;
> + struct guest_memfd_private *private;
> + unsigned long gmem_flags;
> struct folio *folio;
> int r;
>
> /* TODO: Support huge pages. */
> - folio = filemap_grab_folio(inode->i_mapping, index);
> + folio = __filemap_get_folio(inode->i_mapping, index,
> + FGP_LOCK | FGP_ACCESSED | FGP_CREAT | FGP_STABLE,
> + mapping_gfp_mask(inode->i_mapping));
> if (IS_ERR(folio))
> return folio;
>
> - if (folio_test_uptodate(folio))
> + if (folio_test_uptodate(folio)) {
> + private = folio_get_private(folio);
> + atomic_inc(&private->safe);
> + if (accessible)
> + r = guest_memfd_make_accessible(folio);
> + else
> + r = guest_memfd_make_inaccessible(folio);
> +
> + if (r) {
> + atomic_dec(&private->safe);
> + goto out_err;
> + }
> +
> + wake_up_all(&safe_wait);
> return folio;
> + }
>
> - folio_wait_stable(folio);
> + private = kmalloc(sizeof(*private), GFP_KERNEL);
> + if (!private) {
> + r = -ENOMEM;
> + goto out_err;
> + }
> +
> + folio_attach_private(folio, private);
> + /*
> + * 1 for us
> + * 1 for unmapping from userspace
> + */
> + atomic_set(&private->accessible, accessible ? 2 : 0);
> + /*
> + * +1 for us
> + */
> + atomic_set(&private->safe, 1 + base_safe_refs(folio));
> +
> + gmem_flags = (unsigned long)inode->i_mapping->i_private_data;
>
> /*
> * Use the up-to-date flag to track whether or not the memory has been
> @@ -57,19 +116,26 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> * storage for the memory, so the folio will remain up-to-date until
> * it's removed.
> */
> - if (gmem_flags & GUEST_MEMFD_FLAG_CLEAR_INACCESSIBLE) {
> + if (accessible || (gmem_flags & GUEST_MEMFD_FLAG_CLEAR_INACCESSIBLE)) {
> unsigned long nr_pages = folio_nr_pages(folio);
> unsigned long i;
>
> for (i = 0; i < nr_pages; i++)
> clear_highpage(folio_page(folio, i));
> -
> }
>
> - if (ops->prepare_inaccessible) {
> - r = ops->prepare_inaccessible(inode, folio);
> - if (r < 0)
> - goto out_err;
> + if (accessible) {
> + if (ops->prepare_accessible) {
> + r = ops->prepare_accessible(inode, folio);
> + if (r < 0)
> + goto out_free;
> + }
> + } else {
> + if (ops->prepare_inaccessible) {
> + r = ops->prepare_inaccessible(inode, folio);
> + if (r < 0)
> + goto out_free;
> + }
> }
>
> folio_mark_uptodate(folio);
> @@ -78,6 +144,8 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> * unevictable and there is no storage to write back to.
> */
> return folio;
> +out_free:
> + kfree(private);
> out_err:
> folio_unlock(folio);
> folio_put(folio);
> @@ -85,6 +153,132 @@ struct folio *guest_memfd_grab_folio(struct file *file, pgoff_t index, u32 flags
> }
> EXPORT_SYMBOL_GPL(guest_memfd_grab_folio);
>
> +/**
> + * guest_memfd_put_folio() - Drop safe and accessible references to a folio
> + * @folio: the folio to drop references to
> + * @accessible_refs: number of accessible refs to drop, 0 if holding a
> + * reference to an inaccessible folio.
> + */
> +void guest_memfd_put_folio(struct folio *folio, unsigned int accessible_refs)
> +{
> + struct guest_memfd_private *private = folio_get_private(folio);
> +
> + WARN_ON_ONCE(atomic_sub_return(accessible_refs, &private->accessible) < 0);
> + atomic_dec(&private->safe);
> + folio_put(folio);
> + wake_up_all(&safe_wait);
> +}
> +EXPORT_SYMBOL_GPL(guest_memfd_put_folio);
> +
> +/**
> + * guest_memfd_unsafe_folio() - Demotes the current folio reference to "unsafe"
> + * @folio: the folio to demote
> + *
> + * Decrements the number of safe references to this folio. The folio will not
> + * transition to inaccessible until the folio_ref_count is also decremented.
> + *
> + * This function does not release the folio reference count.
> + */
> +void guest_memfd_unsafe_folio(struct folio *folio)
> +{
> + struct guest_memfd_private *private = folio_get_private(folio);
> +
> + atomic_dec(&private->safe);
> + wake_up_all(&safe_wait);
> +}
> +EXPORT_SYMBOL_GPL(guest_memfd_unsafe_folio);
> +
> +/**
> + * guest_memfd_make_accessible() - Attempt to make the folio accessible to host
> + * @folio: the folio to make accessible
> + *
> + * Makes the given folio accessible to the host. If the folio is currently
> + * inaccessible, attempts to convert it to accessible. Otherwise, returns with
> + * EBUSY.
> + *
> + * This function may sleep.
> + */
> +int guest_memfd_make_accessible(struct folio *folio)
> +{
> + struct guest_memfd_private *private = folio_get_private(folio);
> + struct inode *inode = folio_inode(folio);
> + struct guest_memfd_operations *ops = inode->i_private;
> + int r;
> +
> + /*
> + * If we already know the folio is accessible, then no need to do
> + * anything else.
> + */
> + if (atomic_inc_not_zero(&private->accessible))
> + return 0;
> +
> + r = wait_event_timeout(safe_wait,
> + folio_ref_count(folio) == atomic_read(&private->safe),
> + msecs_to_jiffies(10));
> + if (!r)
> + return -EBUSY;
> +
> + if (ops->prepare_accessible) {
> + r = ops->prepare_accessible(inode, folio);
> + if (r)
> + return r;
> + }
> +
> + atomic_inc(&private->accessible);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(guest_memfd_make_accessible);
> +
> +/**
> + * guest_memfd_make_inaccessible() - Attempt to make the folio inaccessible
> + * @folio: the folio to make inaccessible
> + *
> + * Makes the given folio inaccessible to the host. IF the folio is currently
> + * accessible, attempt so convert it to inaccessible. Otherwise, returns with
> + * EBUSY.
> + *
> + * Conversion to inaccessible is allowed when ->accessible decrements to zero,
> + * the folio safe counter == folio reference counter, the folio is unmapped
> + * from host, and ->prepare_inaccessible returns it's ready to do so.
> + *
> + * This function may sleep.
> + */
> +int guest_memfd_make_inaccessible(struct folio *folio)
> +{
> + struct guest_memfd_private *private = folio_get_private(folio);
> + struct inode *inode = folio_inode(folio);
> + struct guest_memfd_operations *ops = inode->i_private;
> + int r;
> +
> + r = atomic_dec_if_positive(&private->accessible);
> + if (r < 0)
> + return 0;
> + else if (r > 0)
> + return -EBUSY;
> +
> + unmap_mapping_folio(folio);
> +
> + r = wait_event_timeout(safe_wait,
> + folio_ref_count(folio) == atomic_read(&private->safe),
> + msecs_to_jiffies(10));
> + if (!r) {
> + r = -EBUSY;
> + goto err;
> + }
> +
> + if (ops->prepare_inaccessible) {
> + r = ops->prepare_inaccessible(inode, folio);
> + if (r)
> + goto err;
> + }
> +
> + return 0;
> +err:
> + atomic_inc(&private->accessible);
> + return r;
> +}
> +EXPORT_SYMBOL_GPL(guest_memfd_make_inaccessible);
> +
> static long gmem_punch_hole(struct file *file, loff_t offset, loff_t len)
> {
> struct inode *inode = file_inode(file);
> @@ -229,10 +423,12 @@ static int gmem_error_folio(struct address_space *mapping, struct folio *folio)
>
> static bool gmem_release_folio(struct folio *folio, gfp_t gfp)
> {
> + struct guest_memfd_private *private = folio_get_private(folio);
> struct inode *inode = folio_inode(folio);
> struct guest_memfd_operations *ops = inode->i_private;
> off_t offset = folio->index;
> size_t nr = folio_nr_pages(folio);
> + unsigned long val, expected;
> int ret;
>
> ret = ops->invalidate_begin(inode, offset, nr);
> @@ -241,14 +437,32 @@ static bool gmem_release_folio(struct folio *folio, gfp_t gfp)
> if (ops->invalidate_end)
> ops->invalidate_end(inode, offset, nr);
>
> + expected = base_safe_refs(folio);
> + val = atomic_read(&private->safe);
> + WARN_ONCE(val != expected, "folio[%x] safe ref: %d != expected %d\n",
> + folio_index(folio), val, expected);
> +
> + folio_detach_private(folio);
> + kfree(private);
> +
> return true;
> }
>
> +static void gmem_invalidate_folio(struct folio *folio, size_t offset, size_t len)
> +{
> + WARN_ON_ONCE(offset != 0);
> + WARN_ON_ONCE(len != folio_size(folio));
> +
> + if (offset == 0 && len == folio_size(folio))
> + filemap_release_folio(folio, 0);
> +}
> +
> static const struct address_space_operations gmem_aops = {
> .dirty_folio = noop_dirty_folio,
> .migrate_folio = gmem_migrate_folio,
> .error_remove_folio = gmem_error_folio,
> .release_folio = gmem_release_folio,
> + .invalidate_folio = gmem_invalidate_folio,
> };
>
> static inline bool guest_memfd_check_ops(const struct guest_memfd_operations *ops)
> @@ -291,8 +505,7 @@ struct file *guest_memfd_alloc(const char *name,
> * instead of reusing a single inode. Each guest_memfd instance needs
> * its own inode to track the size, flags, etc.
> */
> - file = anon_inode_create_getfile(name, &gmem_fops, (void *)flags,
> - O_RDWR, NULL);
> + file = anon_inode_create_getfile(name, &gmem_fops, NULL, O_RDWR, NULL);
> if (IS_ERR(file))
> return file;
>
> @@ -303,6 +516,7 @@ struct file *guest_memfd_alloc(const char *name,
>
> inode->i_private = (void *)ops; /* discards const qualifier */
> inode->i_mapping->a_ops = &gmem_aops;
> + inode->i_mapping->i_private_data = (void *)flags;
> inode->i_mode |= S_IFREG;
> inode->i_size = size;
> mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>
> --
> 2.34.1
>
>