Re: [RFC PATCH v2 1/6] KVM: gmem: Truncate pages on punch hole

From: Sean Christopherson
Date: Thu Oct 05 2023 - 19:49:01 EST


On Thu, Oct 05, 2023, Michael Roth wrote:
> On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote:
> > On Thu, Sep 21, 2023, Sean Christopherson wrote:
> > > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> > > > index a819367434e9..01fb4ca861d0 100644
> > > > --- a/virt/kvm/guest_mem.c
> > > > +++ b/virt/kvm/guest_mem.c
> > > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> > > > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > > > {
> > > > struct list_head *gmem_list = &inode->i_mapping->private_list;
> > > > + struct address_space *mapping = inode->i_mapping;
> > > > pgoff_t start = offset >> PAGE_SHIFT;
> > > > pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > > > struct kvm_gmem *gmem;
> > > >
> > > > + /*
> > > > + * punch hole may result in zeroing partial area. As pages can be
> > > > + * encrypted, prohibit zeroing partial area.
> > > > + */
> > > > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK)
> > > > + return -EINVAL;
> > >
> > > This should be unnecessary, kvm_gmem_fallocate() does
> > >
> > > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > return -EINVAL;
> > >
> > > before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test
> > > fails, then that code needs to be fixed. I'll run your test to double-check,
> > > but AFAICT this is unnecesary.
> >
> > I confirmed that the testcase passes without the extra checks. Just to close the
> > loop, what prompted adding more checks to kvm_gmem_punch_hole()?
>
> I don't know if it's the same issue that Isaku ran into, but for SNP we
> hit a similar issue with the truncate_inode_pages_range(lstart, lend) call.
>
> The issue in that case was a bit more subtle:
>
> - userspace does a hole-punch on a 4K range of its gmem FD, which happens
> to be backed by a 2MB folio.
> - truncate_inode_pages_range() gets called for that 4K range
> - truncate_inode_pages_range() does special handling on the folios at the
> start/end of the range in case they are partial and passes these to
> truncate_inode_partial_folio(folio, lstart, lend). In this case, there's
> just the 1 backing folio. But it *still* gets the special treatment, and
> so gets passed to truncate_inode_partial_folio().
> - truncate_inode_partial_folio() will then zero that 4K range, even though
> it is page-aligned, based on the following rationale in the comments:
>
> /*
> * We may be zeroing pages we're about to discard, but it avoids
> * doing a complex calculation here, and then doing the zeroing
> * anyway if the page split fails.
> */
> folio_zero_range(folio, offset, length);
>
> - after that, .invalidate_folio callback is issued, then the folio is split,
> and the caller (truncate_inode_pages_range()) does another pass through
> the whole range and can free the now-split folio then .free_folio callbacks
> are issued.
>
> Because of that, we can't rely on .invalidate_folio/.free_folio to handle
> putting the page back into a normal host-accessible state, because the
> zero'ing will happen beforehand.

Argh, and that causes an RMP violation #PF.

FWIW, I don't *think* zeroing would be problematic for TDX. The page would get
poisoned, but KVM would re-zero the memory with MOVDIR64B and flush the cache.

> That's why we ended up needing to do this for SNP patches to make sure
> arch-specific invalidation callbacks are issued before the truncation occurs:
>
> https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233
>
> I'd planned to post those as a separate RFC to discuss, but when I came across
> this it seemed like it might be relevant to what the TDX folks might ran into
> here.
>
> If not for the zero'ing logic mentioned above, for SNP at least, the
> .free_folio() ends up working pretty nicely for both truncation and fput(),
> and even plays nicely with live update use-case where the destination gmem
> instance shares the inode->i_mapping, since iput() won't trigger the
> truncate_inode_pages_final() until the last reference goes away so we don't
> have to do anything special in kvm_gmem_release() to determine when we
> should/shouldn't issue the arch-invalidations to clean up things like the
> RMP table.
>
> It seems like the above zero'ing logic could be reworked to only zero non-page
> aligned ranges (as the comments above truncate_inode_pages_range() claim
> should be the case), which would avoid the issue for the gmem use-case. But I
> wonder if some explicit "dont-zero-these-pages" flag might be more robust.
>
> Or maybe there's some other way we should be going about this?

Skipping the write seems like the obvious solution. An address_space flag,
e.g. AS_INACCESSIBLE, would be the easiest thing to implement. Or maybe even
make it AS_DONT_ZERO_ON_TRUNCATE_DAMMIT (mostly joking).

Or a hook in address_space_operations to zero the folio, which conceptually is
better in many ways, but feels like overkill.