Re: [PATCH v4 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave

From: Huang, Ying
Date: Mon Dec 18 2023 - 22:10:11 EST


Gregory Price <gourry.memverge@xxxxxxxxx> writes:

> Extend set_mempolicy2 and mbind2 to support weighted interleave, and
> demonstrate the extensibility of the mpol_args structure.
>
> To support weighted interleave we add interleave weight fields to the
> following structures:
>
> Kernel Internal: (include/linux/mempolicy.h)
> struct mempolicy {
> /* task-local weights to apply to weighted interleave */
> unsigned char weights[MAX_NUMNODES];
> }
> struct mempolicy_args {
> /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
> unsigned char *il_weights; /* of size MAX_NUMNODES */
> }
>
> UAPI: (/include/uapi/linux/mempolicy.h)
> struct mpol_args {
> /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
> unsigned char *il_weights; /* of size pol_max_nodes */
> }
>
> The task-local weights are a single, one-dimensional array of weights
> that apply to all possible nodes on the system. If a node is set in
> the mempolicy nodemask, the weight in `il_weights` must be >= 1,
> otherwise set_mempolicy2() will return -EINVAL. If a node is not
> set in pol_nodemask, the weight will default to `1` in the task policy.
>
> The default value of `1` is required to handle the situation where a
> task migrates to a set of nodes for which weights were not set (up to
> and including the local numa node). For example, a migrated task whose
> nodemask changes entirely will have all its weights defaulted back
> to `1`, or if the nodemask changes to include a mix of nodes that
> were not previously accounted for - the weighted interleave may be
> suboptimal.
>
> If migrations are expected, a task should prefer not to use task-local
> interleave weights, and instead utilize the global settings for natural
> re-weighting on migration.
>
> To support global vs local weighting, we add the kernel-internal flag:
> MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */
>
> This flag is set when il_weights is omitted by set_mempolicy2(), or
> when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal
> mode_flag dictates whether global weights or task-local weights are
> utilized by the the various weighted interleave functions:
>
> * weighted_interleave_nodes
> * weighted_interleave_nid
> * alloc_pages_bulk_array_weighted_interleave
>
> if (pol->flags & MPOL_F_GWEIGHT)
> pol_weights = iw_table;
> else
> pol_weights = pol->wil.weights;
>
> To simplify creations and duplication of mempolicies, the weights are
> added as a structure directly within mempolicy. This allows the
> existing logic in __mpol_dup to copy the weights without additional
> allocations:
>
> if (old == current->mempolicy) {
> task_lock(current);
> *new = *old;
> task_unlock(current);
> } else
> *new = *old
>
> Suggested-by: Rakie Kim <rakie.kim@xxxxxx>
> Suggested-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx>
> Suggested-by: Honggyu Kim <honggyu.kim@xxxxxx>
> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@xxxxxxxxxx>
> Signed-off-by: Gregory Price <gregory.price@xxxxxxxxxxxx>
> Co-developed-by: Rakie Kim <rakie.kim@xxxxxx>
> Signed-off-by: Rakie Kim <rakie.kim@xxxxxx>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@xxxxxx>
> Co-developed-by: Honggyu Kim <honggyu.kim@xxxxxx>
> Signed-off-by: Honggyu Kim <honggyu.kim@xxxxxx>
> Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@xxxxxxxxxx>
> Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@xxxxxxxxxx>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 10 ++
> include/linux/mempolicy.h | 2 +
> include/uapi/linux/mempolicy.h | 2 +
> mm/mempolicy.c | 129 +++++++++++++++++-
> 4 files changed, 139 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 99e1f732cade..0e91efe9e769 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE
> This mode operates the same as MPOL_INTERLEAVE, except that
> interleaving behavior is executed based on weights set in
> /sys/kernel/mm/mempolicy/weighted_interleave/
> + when configured to utilize global weights, or based on task-local
> + weights configured with set_mempolicy2(2) or mbind2(2).
>
> Weighted interleave allocations pages on nodes according to
> their weight. For example if nodes [0,1] are weighted [5,2]
> @@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE
> 2 pages allocated on node1. This can better distribute data
> according to bandwidth on heterogeneous memory systems.
>
> + When utilizing task-local weights, weights are not rebalanced
> + in the event of a task migration. If a weight has not been
> + explicitly set for a node set in the new nodemask, the
> + value of that weight defaults to "1". For this reason, if
> + migrations are expected or possible, users should consider
> + utilizing global interleave weights.
> +
> NUMA memory policy supports the following optional mode flags:
>
> MPOL_F_STATIC_NODES
> @@ -514,6 +523,7 @@ Extended Mempolicy Arguments::
> __u16 mode_flags;
> __s32 home_node; /* mbind2: policy home node */
> __aligned_u64 pol_nodes; /* nodemask pointer */
> + __aligned_u64 il_weights; /* u8 buf of size pol_maxnodes */
> __u64 pol_maxnodes;
> __s32 policy_node; /* get_mempolicy2: policy node information */
> };
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index aeac19dfc2b6..387c5c418a66 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -58,6 +58,7 @@ struct mempolicy {
> /* Weighted interleave settings */
> struct {
> unsigned char cur_weight;
> + unsigned char weights[MAX_NUMNODES];
> } wil;
> };
>
> @@ -70,6 +71,7 @@ struct mempolicy_args {
> unsigned short mode_flags; /* policy mode flags */
> int home_node; /* mbind: use MPOL_MF_HOME_NODE */
> nodemask_t *policy_nodes; /* get/set/mbind */
> + unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */
> int policy_node; /* get: policy node information */
> };
>
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index ec1402dae35b..16fedf966166 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -33,6 +33,7 @@ struct mpol_args {
> __u16 mode_flags;
> __s32 home_node; /* mbind2: policy home node */
> __aligned_u64 pol_nodes;
> + __aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */
> __u64 pol_maxnodes;
> __s32 policy_node; /* get_mempolicy: policy node info */
> };

You break the ABI you introduced earlier in the patchset. Although they
are done within a patchset, I don't think that it's a good idea. I
suggest to finalize the ABI in the first place. Otherwise, people check
git log will be confused by ABI broken. This makes it easier to be
reviewed too.

> @@ -75,6 +76,7 @@ struct mpol_args {
> #define MPOL_F_SHARED (1 << 0) /* identify shared policies */
> #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
> #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */
> +#define MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */
>
> /*
> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl

--
Best Regards,
Huang, Ying

[snip]