Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

From: Gregory Price
Date: Tue Sep 19 2023 - 12:32:01 EST


On Mon, Sep 18, 2023 at 08:34:16PM -0700, Andy Lutomirski wrote:
>
>
> On Sun, Sep 10, 2023, at 4:49 AM, Gregory Price wrote:
> > On Sun, Sep 10, 2023 at 02:36:40PM -0600, Jonathan Corbet wrote:
> >>
> >> So this is probably a silly question, but just to be sure ... what is
> >> the permission model for this system call? As far as I can tell, the
> >> ability to move pages is entirely unrestricted, with the exception of
> >> pages that would need MPOL_MF_MOVE_ALL. If so, that seems undesirable,
> >> but probably I'm just missing something ... ?
> >>
> >> Thanks,
> >>
> >> jon
> >
> > Not silly, looks like when U dropped the CAP_SYS_NICE check (no task to
> > check against), check i neglected to add a CAP_SYS_ADMIN check.
>
> Global, I presume?
>
> I have to admit that I don’t think this patch set makes sense at all.
>
> As I understand it, there are two kinds of physical memory resource in CXL: those that live on a device and those that live in host memory.
>
> Device memory doesn’t migrate as such: if a page is on an accelerator, it’s on that accelerator. (If someone makes an accelerator with *two* PCIe targets and connects each target to a different node, that’s a different story.)
>
> Host memory is host memory. CXL may access it, and the CXL access from a given device may be faster if that device is connected closer to the memory. And the device may or may not know the virtual address and PASID of the memory.
>

The CXL memory description here is a bit inaccurate. Memory on the CXL
bus is not limited to host and accelerator, CXL memory devices may also
present memory for use by the system as-if it were just DRAM as well.
The accessing mechanisms are the same (i.e. you can 'mov rax, [rbx]'
and the result is a cacheline fetch that goes over the cxl bus rather
than the DRAM memory controllers).

Small CXL background for the sake of clarity:

type-2 devices are "accelerators", and the memory relationships you
describe here are roughly accurate. The intent of this interface is not
really for the purpose of managing type-2/accelerator device memory.

type-3 devices are "memory devices", whose intent is to provide the
system additional memory resources that get mapped into one or more numa
nodes. The intent of these devices is to present memory to the kernel
*as-if* it were regular old DRAM just with different latency and
bandwidth attributes. This is a simplification of the overall goal.


So from the perspective of the kernel and a memory-tiering system, we
have numa nodes which abstract physical memory, and that physical memory
may actually live anywhere (DRAM, CXL, where-ever). This memory is
fungible with the exception that CXL memory should be placed in
ZONE_MOVABLE to ensure the hot-plugability of those memory devices.


The intent of this interface is to make page-migration decisions without
the need to track individual processes or virtual address mappings.

One example would be to utilize the idle page tracking mechanism from
userland to make migration decisions.

https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html

This mechanism allows a user to determine which PFNs are idle. Combine
this information with a move_phys_page syscall, you can now implement
demotion/promotion in user-land without having to identify the virtual
address mapping of those PFN's in user-land.


> I fully believe that there’s some use for migrating host memory to a node that's closer to a device. But I don't think this API is the right way. First, something needs to figure out that the host memory should be migrated. Doing this presumably involves identifying which (logical!) memory is being accessed and deciding to move it. Maybe new APIs are needed to enable this.
>

The intent is not to migrate memory to making it "closer to a device",
assuming you mean the intent is to make that data closer to a device
that is using it (i.e. an accelerator).

The intent is to allow migration of memory based on a user-defined
policy via the usage of physical addresses.

Lets consider a bandwidth-expansion focused tiering policy. Each
additional CXL Type-3 Memory device provides additional memory
bandwidth to a processor via its pcie/cxl lanes.

If all you care about is latency, moving/migrating pages closer to the
processor is beneficial. However, if you care about maximizing
bandwidth, distributing memory across all possible devices with some
statistical distribution is a better goal.

So sometimes you actually want hot data further away because it allows
for more concurrent cacheline fetches to occur.


The question as to whether getting the logical memory address is
required, useful, or performant depends on what sources of information
you can pull physical address information from.

Above I explained idle page tracking, but another example would be the
CXL device directly, which knows 2 pieces of information (generally):

1) The extent of the memory it is hosting (some size)
2) The physical-to-device address mapping for the system contacting it.

The device talks (internally) in 0-based addressing (0x0 up to 0x...),
but the host places the host physical address (HPA) on the bus
(0x123450000). The device receives and converts 0x123450000 (HPA) into
a 0-base address (device-physical-address, DPA).

How does this relate to this interface?

Consider a device which provides a "heat-map" for the memory it is
hosting. If a user or system requests this heat-map, the device can
only provide that information in terms of either HPA or DPA. If DPA,
then the host can recover the HPA by simply looking at the mapping it
programmed the device with. This reverse-transaction (DPA-to-HPA) is
relatively inexpensive.

The idle-page tracking interface is actually a good example of this. It
is functionally an heat-map for the entire system.

However, things get extraordinary expensive after this. HPA to host
virtual address translation (HPA to HVA) requires inspecting every task
that may map that HPA in its page tables. When the cacheline fetch hits
the bus, you are well below the construct of a "task", and the devices
has no way of telling you what task is using memory on that device.

This makes any kind of tiering operation based on this type of heat-map
information somewhat of a non-starter. You steal so much performance
just converting that information into task-specific information, that
you may as well not bother doing it.

Instead, this interface would allow for a tiering policy to operate on
such heat-map information directly, and since all CXL memory is intended
to be placed in ZONE_MOVABLE, that memory should always be migratable.

> But this API is IMO rather silly. Just as a trivial observation, if you migrate a page you identify by physical address, *that physical address changes*. So the only way it possibly works is that whatever heuristic is using the API knows to invalidate itself after calling the API, but of course it also needs to invalidate itself if the kernel becomes intelligent enough to migrate the page on its own or the owner of the logical page triggers migration, etc.
>
> Put differently, the operation "migrate physical page 0xABCD000 to node 3" makes no sense. That physical address belongs to whatever node its on, and without some magic hardware support that does not currently exist, it's not going anywhere at runtime.
>
> I just don't see it this code working well, never mind the security issues.

I think this is more of a terminology issue. I'm not married to the
name, but to me move_phys_page is intuitively easier to understand
because move_page exists and the only difference between the two
interfaces is virtual vs physical addressing.

move_pages doesn't "migrate a virtual page" either, it "migrates the
data pointed to by this virtual address to another physical page located
on the target numa node".

Likewise this interface "migrates the data located at the physical address,
assuming the physical address is mapped, to another page on the target numa
node".

The virtual-address sister-function doesn't "move the physical page"
either, it moves the data from one physical page to another and updates
the page tables.

~Gregory