Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

From: Gregory Price
Date: Sat Sep 09 2023 - 12:26:24 EST


On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
> On Thu, Sep 7, 2023, at 09:54, Gregory Price wrote:
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 2d0b1bd866ea..25db6d71af0c 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -457,3 +457,4 @@
> > 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
> > 451 i386 cachestat sys_cachestat
> > 452 i386 fchmodat2 sys_fchmodat2
> > +454 i386 move_phys_pages sys_move_phys_pages
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 1d6eee30eceb..9676f2e7698c 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -375,6 +375,7 @@
> > 451 common cachestat sys_cachestat
> > 452 common fchmodat2 sys_fchmodat2
> > 453 64 map_shadow_stack sys_map_shadow_stack
> > +454 common move_phys_pages sys_move_phys_pages
>
> Doing only x86 is fine for review purposes, but note that
> once there is consensus on actually merging it and the syscall
> number, you should do the same for all architectures. I think
> most commonly we do one patch to add the code and another
> patch to hook it up to all the syscall.tbl files and the
> include/uapi/asm-generic/unistd.h file.
>

I'd looked through a few prior patches and that's what I observed so I
tried to follow that. For the RFC i think it only made sense to
integrate it with the system i'm actually testing on, but I'll
eventually need to test it on ARM and such.

Noted.

> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 22bc6bc147f8..6860675a942f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned
> > long nr_pages,
> > const int __user *nodes,
> > int __user *status,
> > int flags);
> > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
> > + const void __user * __user *pages,
> > + const int __user *nodes,
> > + int __user *status,
> > + int flags);
>
> The prototype looks good from a portability point of view,
> i.e. I made sure this is portable across CONFIG_COMPAT
> architectures etc.
>
> What I'm not sure about is whether the representation of physical
> memory pages as a 'void *' is a good choice, I can see potential
> problems with this:
>
> - it's not really a pointer, but instead a shifted PFN number
> in the implementation
>
> - physical addresses may be wider than pointers on 32-bit
> architectures with CONFIG_PHYS_ADDR_T_64BIT
>

Hm, good points.

I tried to keep the code close to move_pages for the sake of
familiarity and ease of review, but the physical address length
is not something i'd considered, and I'm not sure how exactly
we would handle CONFIG_PHYS_ADDR_T_64BIT. If you have an idea,
I'm happy to run with it.

on address vs pfn:

Would using PFNs cause issues with portability of userland code? User
code that gets access to a physical address would have to convert
that to a PFN, which would be kernel-defined. That could be easy
enough if the kernel exposed the shift size somewhere.

I can see arguments for PFN as opposed to physical address for
portability sake. This doesn't handle the 64-bit phys address
configuration issue, of course.

In the case where a user already has a PFN (e.g. via mm/page_idle),
defining it as a PFN interface would certainly make more sense.

Mayhaps handling both cases would be useful, depending on the source
information (see below for more details).

> I'm also not sure where the user space caller gets the
> physical addresses from, are those not intentionally kept
> hidden from userspace?
>
> Arnd

There are presently 4 places (that I know of), and 1 that is being
proposed here in the near future

1) Generally: /proc/pid/pagemap can be used to do page table
translations. I think this is only really useful for testing,
since if you have the virtual address and pid you would use
move_pages, but it's certainly useful for testing this.

2) X86: IBS (AMD) and PEBS (Intel) can be configured to return physical
address information instead of virtual address information. In fact
you can configure it to give you both the virtual and physical
address for a process.

3) zoneinfo: /proc/zoneinfo exposes the start PFN of zones

4) /sys/kernel/mm/page_idle: A way to query whether a PFN is idle.
While itself seemingly not useful, if the goal is to migrate
less-used pages to "lower tiers" of memory, then you can query
the bitmap, directly recover idle PFNs, and attempt to migrate
them (which may fail, for a variety of reasons).

https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html


5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
hot/cold information about its memory. If a heatmap is provided,
for example, it would have to be a device address (0-based) or a
physical address (some base defined by the kernel and programmed
into device decoders such that it can convert them to 0-based).

This is presently being proposed but has not been agreed upon yet.

~Gregory