Re: [RFC] mm: add new syscall pidfd_set_mempolicy()

From: Frank van der Linden
Date: Mon Oct 10 2022 - 12:22:34 EST


For consistency with process_madvise(), I would suggest calling it
process_set_mempolicy. Other than that, this makes sense. To complete
the set, perhaps a process_mbind() should be added as well. What do
you think?

- Frank

On Mon, Oct 10, 2022 at 2:49 AM Zhongkun He
<hezhongkun.hzk@xxxxxxxxxxxxx> wrote:
>
> There is usecase that System Management Software(SMS) want to give a
> memory policy to other processes to make better use of memory.
>
> The information about how to use memory is not known to the app.
> Instead, it is known to the userspace daemon(SMS), and that daemon
> will decide the memory usage policy based on different factors.
>
> To solve the issue, this patch introduces a new syscall
> pidfd_set_mempolicy(2). it sets the NUMA memory policy of the thread
> specified in pidfd.
>
> In current process context there is no locking because only the process
> accesses its own memory policy, so task_work is used in
> pidfd_set_mempolicy() to update the mempolicy of the process specified
> in pidfd, avoid using locks and race conditions.
>
> The API is as follows,
>
> long pidfd_set_mempolicy(int pidfd, int mode,
> const unsigned long __user *nmask,
> unsigned long maxnode,
> unsigned int flags);
>
> Set's the [pidfd] task's "task/process memory policy". The pidfd argument
> is a PID file descriptor (see pidfd_open(2) man page) that specifies the
> process to which the mempolicy is to be applied. The flags argument is
> reserved for future use; currently, this argument must be specified as 0.
> Please see the set_mempolicy(2) man page for more details about
> other's arguments.
>
> Suggested-by: Michal Hocko <mhocko@xxxxxxxx>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@xxxxxxxxxxxxx>
> ---
> .../admin-guide/mm/numa_memory_policy.rst | 21 ++++-
> arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> arch/arm/tools/syscall.tbl | 1 +
> arch/arm64/include/asm/unistd.h | 2 +-
> arch/arm64/include/asm/unistd32.h | 3 +-
> arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
> arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> arch/s390/kernel/syscalls/syscall.tbl | 1 +
> arch/sh/kernel/syscalls/syscall.tbl | 1 +
> arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> include/linux/mempolicy.h | 11 +++
> include/linux/syscalls.h | 4 +
> include/uapi/asm-generic/unistd.h | 5 +-
> kernel/sys_ni.c | 1 +
> mm/mempolicy.c | 89 +++++++++++++++++++
> 24 files changed, 146 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 5a6afecbb0d0..b864dd88b2d2 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -408,9 +408,10 @@ follows:
> Memory Policy APIs
> ==================
>
> -Linux supports 4 system calls for controlling memory policy. These APIS
> -always affect only the calling task, the calling task's address space, or
> -some shared object mapped into the calling task's address space.
> +Linux supports 5 system calls for controlling memory policy. The first four
> +APIS affect only the calling task, the calling task's address space, or some
> +shared object mapped into the calling task's address space. The last one can
> +set the mempolicy of task specified in pidfd.
>
> .. note::
> the headers that define these APIs and the parameter data types for
> @@ -473,6 +474,20 @@ closest to which page allocation will come from. Specifying the home node overri
> the default allocation policy to allocate memory close to the local node for an
> executing CPU.
>
> +Set [pidfd Task] Memory Policy::
> +
> + long sys_pidfd_set_mempolicy(int pidfd, int mode,
> + const unsigned long __user *nmask,
> + unsigned long maxnode,
> + unsigned int flags);
> +
> +Set's the [pidfd] task's "task/process memory policy". The pidfd argument is
> +a PID file descriptor (see pidfd_open(2) man page) that specifies the process
> +to which the mempolicy is to be applied. The flags argument is reserved for
> +future use; currently, this argument must be specified as 0. Please see the
> +set_mempolicy(2) man page for more details about other's arguments.
> +
> +
>
> Memory Policy Command Line Interface
> ====================================
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 3515bc4f16a4..d86baa5a6eca 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -490,3 +490,4 @@
> 558 common process_mrelease sys_process_mrelease
> 559 common futex_waitv sys_futex_waitv
> 560 common set_mempolicy_home_node sys_ni_syscall
> +561 common pidfd_set_mempolicy sys_ni_syscall
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index ac964612d8b0..a7ccab8aafef 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -464,3 +464,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 037feba03a51..64a514f90131 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -39,7 +39,7 @@
> #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
> #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
>
> -#define __NR_compat_syscalls 451
> +#define __NR_compat_syscalls 452
> #endif
>
> #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 604a2053d006..2e178e9152e6 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -907,7 +907,8 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
> __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> #define __NR_set_mempolicy_home_node 450
> __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> -
> +#define __NR_pidfd_set_mempolicy 451
> +__SYSCALL(__NR_pidfd_set_mempolicy, sys_pidfd_set_mempolicy)
> /*
> * Please add new compat syscalls above this comment and update
> * __NR_compat_syscalls in asm/unistd.h.
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index 78b1d03e86e1..4e5bca7a985c 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -371,3 +371,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index b1f3940bc298..8e50bf8ec41d 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -450,3 +450,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index 820145e47350..f48e32532c5f 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -456,3 +456,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 253ff994ed2e..58e7c3140f4a 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -389,3 +389,4 @@
> 448 n32 process_mrelease sys_process_mrelease
> 449 n32 futex_waitv sys_futex_waitv
> 450 n32 set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 n32 pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index 3f1886ad9d80..70090c3ada16 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -365,3 +365,4 @@
> 448 n64 process_mrelease sys_process_mrelease
> 449 n64 futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 n64 pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index 8f243e35a7b2..b0b0bad64fa0 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -438,3 +438,4 @@
> 448 o32 process_mrelease sys_process_mrelease
> 449 o32 futex_waitv sys_futex_waitv
> 450 o32 set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 o32 pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index 8a99c998da9b..4dfd328490ad 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -448,3 +448,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 2600b4237292..20381463b2b5 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -530,3 +530,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 nospu set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 nospu pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 799147658dee..5e170dca81f6 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -453,3 +453,4 @@
> 448 common process_mrelease sys_process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 2de85c977f54..bd3a24a9b41f 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -453,3 +453,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index 4398cc6fb68d..bea2b5c6314b 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -496,3 +496,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..97bc70f5a8f7 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
> 448 i386 process_mrelease sys_process_mrelease
> 449 i386 futex_waitv sys_futex_waitv
> 450 i386 set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 i386 pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..ba6d19dbd8f8 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 52c94ab5c205..9f7c563da4fb 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -421,3 +421,4 @@
> 448 common process_mrelease sys_process_mrelease
> 449 common futex_waitv sys_futex_waitv
> 450 common set_mempolicy_home_node sys_set_mempolicy_home_node
> +451 common pidfd_set_mempolicy sys_pidfd_set_mempolicy
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 668389b4b53d..f3c522e8eb38 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -54,6 +54,17 @@ struct mempolicy {
> } w;
> };
>
> +/*
> + * The information of mempolicy used by task_work
> + * to update it dynamically.
> + */
> +struct mpol_worker {
> + unsigned short mode;
> + unsigned short flags;
> + nodemask_t nodes;
> + struct rcu_head update_work;
> +};
> +
> /*
> * Support for managing mempolicy data objects (clone, copy, destroy)
> * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..e611c088050b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,10 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
> asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
> unsigned long home_node,
> unsigned long flags);
> +asmlinkage long sys_pidfd_set_mempolicy(int pidfd, int mode,
> + const unsigned long __user *nmask,
> + unsigned long maxnode,
> + unsigned int flags);
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..c38013bbf5b0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
> #define __NR_set_mempolicy_home_node 450
> __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>
> +#define __NR_pidfd_set_mempolicy 451
> +__SYSCALL(__NR_pidfd_set_mempolicy, sys_pidfd_set_mempolicy)
> +
> #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..5053deb888f7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -299,6 +299,7 @@ COND_SYSCALL(set_mempolicy);
> COND_SYSCALL(migrate_pages);
> COND_SYSCALL(move_pages);
> COND_SYSCALL(set_mempolicy_home_node);
> +COND_SYSCALL(pidfd_set_mempolicy);
>
> COND_SYSCALL(perf_event_open);
> COND_SYSCALL(accept4);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b73d3248d976..2ce9be1a9dfd 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -102,6 +102,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/printk.h>
> #include <linux/swapops.h>
> +#include <linux/task_work.h>
>
> #include <asm/tlbflush.h>
> #include <asm/tlb.h>
> @@ -1572,6 +1573,94 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
> return kernel_set_mempolicy(mode, nmask, maxnode);
> }
>
> +/*
> + * In process context there is no locking, so task_work is used to update
> + * the mempolicy of process specified in pidfd, avoid using locks.
> + */
> +static void pidfd_update_mpol(struct callback_head *cb)
> +{
> + struct mpol_worker *worker = container_of(cb, struct mpol_worker, update_work);
> +
> + do_set_mempolicy(worker->mode, worker->flags, &worker->nodes);
> + kfree(worker);
> +}
> +
> +/* Set the memory policy of the process specified in pidfd. */
> +static long do_pidfd_set_mempolicy(int pidfd, int mode, const unsigned long __user *nmask,
> + unsigned long maxnode, unsigned int flags)
> +{
> + nodemask_t nodes, task_nodes, tmp;
> + struct mpol_worker *worker;
> + struct task_struct *task;
> + unsigned short mode_flags;
> + int ret, lmode = mode;
> + unsigned int f_flags;
> +
> + if (flags)
> + return -EINVAL;
> +
> + task = pidfd_get_task(pidfd, &f_flags);
> + if (IS_ERR(task))
> + return PTR_ERR(task);
> + /*
> + * Check if this process has the right to modify the specified
> + * process.
> + */
> + if (!ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS)) {
> + ret = -EPERM;
> + goto out;
> + }
> +
> + /*
> + * Require CAP_SYS_NICE for influencing process performance.
> + * User's task is allowed only.
> + */
> + if (!capable(CAP_SYS_NICE) || (task->flags & PF_KTHREAD)) {
> + ret = -EPERM;
> + goto out;
> + }
> +
> + ret = get_nodes(&nodes, nmask, maxnode);
> + if (ret)
> + goto out;
> +
> + ret = sanitize_mpol_flags(&lmode, &mode_flags);
> + if (ret)
> + goto out;
> +
> + task_nodes = cpuset_mems_allowed(task);
> +
> + if (mode_flags & MPOL_F_RELATIVE_NODES)
> + mpol_relative_nodemask(&tmp, &nodes, &task_nodes);
> + else
> + nodes_and(tmp, nodes, task_nodes);
> +
> + if (nodes_empty(tmp)) {
> + ret = -EINVAL;
> + goto out;
> + }
> + worker = kzalloc(sizeof(struct mpol_worker), GFP_KERNEL | __GFP_NOWARN);
> + if (!worker) {
> + ret = -ENOMEM;
> + goto out;
> + }
> + init_task_work(&worker->update_work, pidfd_update_mpol);
> + worker->flags = mode_flags;
> + worker->mode = lmode;
> + worker->nodes = nodes;
> +
> + ret = task_work_add(task, &worker->update_work, TWA_SIGNAL);
> +out:
> + put_task_struct(task);
> + return ret;
> +}
> +
> +SYSCALL_DEFINE5(pidfd_set_mempolicy, int, pidfd, int, mode, const unsigned long __user *, nmask,
> + unsigned long, maxnode, unsigned int, flags)
> +{
> + return do_pidfd_set_mempolicy(pidfd, mode, nmask, maxnode, flags);
> +}
> +
> static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
> const unsigned long __user *old_nodes,
> const unsigned long __user *new_nodes)
> --
> 2.25.1
>
>