Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order

From: Dave Hansen
Date: Tue Jun 22 2021 - 08:07:04 EST


Yan, your reply came through in HTML. It doesn't bother me too much,
but you'll find your replies dropped by LKML and other mailing lists
if you do this.

On 6/21/21 7:50 AM, Zi Yan wrote:
> Is there a plan of allowing user to change where the migration path
> starts? Or maybe one step further providing an interface to allow
> user to specify the demotion path. Something like
> /sys/devices/system/node/node*/node_demotion.

We actually had this in an earlier series. I pulled it out because we
don't really *need* this ABI at the moment. But, I totally agree that
it would be handy for many things, including any non-obvious topology
where the built-in ordering isn't optimal.

> I don't think that's necessary at least for now. Do you know any
> real world use case for this?
>
> In our P9+volta system, GPU memory is exposed as a NUMA node. For
> the GPU workloads with data size greater than GPU memory size, it
> will be very helpful to allow pages in GPU memory to be
> migrated/demoted to CPU memory. With your current assumption, GPU
> memory -> CPU memory demotion seems not possible, right? This
> should also apply to any system with a device memory exposed as a
> NUMA node and workloads running on the device and using CPU memory
> as a lower tier memory than the device memory.

Yes, with the current ordering, CPU memory would be demoted to the
GPU, not the other way around. The right way to fix this (on ACPI
platforms at least) is probably to use the HMAT table and build the
demotion based on any memory targets rather than just CPUs.

That would be a great future enhancement to all of this. But, because
not all systems have HMATs, we also need something more basic, which
is what is in this series.