Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads

From: Andrea Arcangeli
Date: Thu Dec 03 2020 - 23:11:57 EST


Hi Peter,

On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote:
> I'm just afraid there's no space left for a migration entry, because migration
> entries fills in the pfn information into swp offset field rather than a real
> offset (please refer to make_migration_entry())? I assume PFN can use any bit.
> Or did I miss anything?
>
> I went back to see the original proposal from Hugh:
>
> IIUC you only need a single value, no need to carve out another whole
> swp_type: could probably be swp_offset 0 of any swp_type other than 0.
>
> Hugh/Andrea, sorry if this is a stupid swap question: could you help explain
> why swp_offset=0 won't be used by any swap device? I believe it's correct,
> it's just that I failed to figure out the reason myself. :(
>

Hugh may want to review if I got it wrong, but there's basically three
ways.

swp_type would mean adding one more reserved value in addition of
SWP_MIGRATION_READ and SWP_MIGRATION_WRITE (kind of increasing
SWP_MIGRATION_NUM to 3).

swp_offset = 0 works in combination of SWP_MIGRATION_WRITE and
SWP_MIGRATION_READ if we enforce pfn 0 is never used by the kernel
(I'd feel safer with pfn value -1UL truncated to the bits of the swp
offset, since the swp_entry format is common code).

The bit I was suggesting is just one more bit like _PAGE_SWP_UFFD_WP
from the pte, one that cannot ever be set in any swp entry today. I
assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
you may want to verify it...

It'd be set on the pte (not in the swap entry), then it doesn't matter
much what's inside the swp_entry anymore. The pte value would be
generated with this:

pte_swp_uffd_wp_unmap(swp_entry_to_pte(swp_entry(SWP_MIGRATION_READ, 0)))

(maybe SWP_MIGRATION_READ could also be 0 and then it can be just
enough to set that single bit in the pte and nothing else, all other
bits zero)

We never store a raw swp entry in the pte (the raw swp entry is stored
in the xarray, it's the index of the swapcache).

To solve our unmap issue we only deal with pte storage (no xarray
index storage). This is why it can also be in the arch specific pte
representation of the swp entry, it doesn't need to be a special value
defined in the swp entry common code.

Being the swap entry to pte conversion arch dependent, such bit needs
to be defined by each arch (reserving a offset or type value in swp
entry would solve it in the common code).

#define SWP_OFFSET_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)

All bits below PROTNONE are available for software use and we use bit
1 (soft dirty) 2 (uffd_wp). protnone bit 8 itself (global bit) must
not be set or it'll look protnone and pte_present will be true. Bit 7
is PSE so it's also not available because pte_present checks that
too.

It appears you can pick between bit 3 4 5 6 at your own choice and it
doesn't look like we're running out of those yet (if we were there
would be a bigger incentive to encode it as part of the swp entry
format). Example:

#define _PAGE_SWP_UFFD_WP_UNMAP _PAGE_PWT

If that bit it set and pte_present is false, then everything else in
that that pte is meaningless and it means uffd wrprotected
pte_none.

So in the migration-entry/swapin page fault path, you could go one
step back and check the pte for such bit, if it's set it's not a
migration entry.

If there's a read access it should fill the page mark with
shmem_fault, keep the pte wrprotected and then set _PAGE_UFFD_WP on
the pte. If there's a write access it should invoke handle_userfault.

If there's any reason where the swp_entry reservation is simpler
that's ok too, you'll see an huge lot of more details once you try to
implement it so you'll be better able to judje later. I'm greatly
simplifying everything but this is not simple feat...

Thanks,
Andrea