Re: [PATCH -V6 RESEND 2/3] NOT kernel/man-pages: man2/set_mempolicy.2: Add mode flag MPOL_F_NUMA_BALANCING
From: Mel Gorman
Date: Thu Dec 03 2020 - 04:38:25 EST
On Thu, Dec 03, 2020 at 09:49:02AM +0800, Huang, Ying wrote:
> >> diff --git a/man2/set_mempolicy.2 b/man2/set_mempolicy.2
> >> index 68011eecb..3754b3e12 100644
> >> --- a/man2/set_mempolicy.2
> >> +++ b/man2/set_mempolicy.2
> >> @@ -113,6 +113,12 @@ A nonempty
> >> .I nodemask
> >> specifies node IDs that are relative to the set of
> >> node IDs allowed by the process's current cpuset.
> >> +.TP
> >> +.BR MPOL_F_NUMA_BALANCING " (since Linux 5.11)"
> >> +Enable the Linux kernel NUMA balancing for the task if it is supported
> >> +by kernel.
> >> +If the flag isn't supported by Linux kernel, return -1 and errno is
> >> +set to EINVAL.
> >> .PP
> >> .I nodemask
> >> points to a bit mask of node IDs that contains up to
> >> @@ -293,6 +299,9 @@ argument specified both
> >
> > Should this be expanded more to clarify it applies to MPOL_BIND
> > specifically?
> >
> > Maybe the first patch should be expanded more and explictly fail if
> > MPOL_F_NUMA_BALANCING is used with anything other than MPOL_BIND?
>
> For MPOL_PREFERRED, why could we not use NUMA balancing to migrate pages
> to the accessing local node if it is same as the preferred node?
You could but the kernel patch does not do that by making preferred_nid
stick to the preferred node when hinting faults are trapped on that VMA.
It would have to be a separate patch coupled with a man page update. If
you wanted to go in this direction in the future, then the patch should
explicitly return an error *now* if MPOL_PREFERRED is or'd with
MPOL_F_NUMA_BALANCING so that an application becomes aware of
MPOL_F_NUMA_BALANCING then it can detect if support exists in the
current running kernel.
> Even for MPOL_INTERLEAVE, if the target node is the same as the
> accessing local node, can we use NUMA balancing to migrate pages?
>
The intent of MPOL_INTERLEAVE is to average the costs of the memory
access so the average cost across the VMA is roughly similar across the
entire range. This may be particularly important if the VMA is shared
between multiple threads that are spread out on multiple nodes. A change
in semantics there should be clearly documented.
Similar, if you want to go in this direction, MPOL_F_NUMA_BALANCING
should be chcked against MPOL_INTERLEAVE and explicitly fail now so
suport can be detected at runtime.
> So, I prefer to make MPOL_F_NUMA_BALANCING to be
>
> Optimizing with NUMA balancing if possible, and we may add more
> optimization in the future.
>
Maybe, but I think it's best that the actual behaviour of the kernel is
documented instead of desired behaviour or future planning.
--
Mel Gorman
SUSE Labs