Re: [PATCH] mm: use only pidfd for process_madvise syscall

From: Minchan Kim
Date: Mon May 18 2020 - 17:14:12 EST


On Mon, May 18, 2020 at 12:22:56PM -0700, Suren Baghdasaryan wrote:
> On Fri, May 15, 2020 at 6:21 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >
> > Based on discussion[1], people didn't feel we need to support both
> > pid and pidfd for every new coming API[2] so this patch keeps only
> > pidfd. This patch also changes flags's type with "unsigned int".
> > So finally, the API is as follows,
> >
> > ssize_t process_madvise(int pidfd, const struct iovec *iovec,
> > unsigned long vlen, int advice, unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges from external process as well as
> > local process. It provides the advice to address ranges of process
> > described by iovec and vlen. The goal of such advice is to improve system
> > or application performance.
> >
> > The pidfd selects the process referred to by the PID file descriptor
> > specified in pidfd. (See pidofd_open(2) for further information)
> >
> > The pointer iovec points to an array of iovec structures, defined in
> > <sys/uio.h> as:
> >
> > struct iovec {
> > void *iov_base; /* starting address */
> > size_t iov_len; /* number of bytes to be advised */
> > };
> >
> > The iovec describes address ranges beginning at address(iov_base)
> > and with size length of bytes(iov_len).
> >
> > The vlen represents the number of elements in iovec.
> >
> > The advice is indicated in the advice argument, which is one of the
> > following at this moment if the target process specified by idtype and
>
> There is no idtype parameter anymore, so maybe just "if the target
> process is external"?

Fixed.

>
> > id is external.
> >
> > MADV_COLD
> > MADV_PAGEOUT
> > MADV_MERGEABLE
> > MADV_UNMERGEABLE
> >
> > Permission to provide a hint to external process is governed by a
> > ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
> >
> > The process_madvise supports every advice madvise(2) has if target
> > process is in same thread group with calling process so user could
> > use process_madvise(2) to extend existing madvise(2) to support
> > vector address ranges.
> >
> > RETURN VALUE
> > On success, process_madvise() returns the number of bytes advised.
> > This return value may be less than the total number of requested
> > bytes, if an error occurred. The caller should check return value
> > to determine whether a partial advice occurred.
> >
> > [1] https://lore.kernel.org/linux-mm/20200509124817.xmrvsrq3mla6b76k@wittgenstein/
> > [2] https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@xxxxxxxxxxxxx/
> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > ---
> > mm/madvise.c | 42 +++++++++++++-----------------------------
> > 1 file changed, 13 insertions(+), 29 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index d3fbbe52d230..35c9b220146a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1229,8 +1229,8 @@ static int process_madvise_vec(struct task_struct *target_task,
> > return ret;
> > }
> >
> > -static ssize_t do_process_madvise(int which, pid_t upid, struct iov_iter *iter,
> > - int behavior, unsigned long flags)
> > +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter,
> > + int behavior, unsigned int flags)
> > {
> > ssize_t ret;
> > struct pid *pid;
> > @@ -1241,26 +1241,12 @@ static ssize_t do_process_madvise(int which, pid_t upid, struct iov_iter *iter,
> > if (flags != 0)
> > return -EINVAL;
> >
> > - switch (which) {
> > - case P_PID:
> > - if (upid <= 0)
> > - return -EINVAL;
> > -
> > - pid = find_get_pid(upid);
> > - if (!pid)
> > - return -ESRCH;
> > - break;
> > - case P_PIDFD:
> > - if (upid < 0)
> > - return -EINVAL;
> > -
> > - pid = pidfd_get_pid(upid);
> > - if (IS_ERR(pid))
> > - return PTR_ERR(pid);
> > - break;
> > - default:
> > + if (pidfd < 0)
> > return -EINVAL;
> > - }
> > +
> > + pid = pidfd_get_pid(pidfd);
> > + if (IS_ERR(pid))
> > + return PTR_ERR(pid);
> >
> > task = get_pid_task(pid, PIDTYPE_PID);
> > if (!task) {
> > @@ -1292,9 +1278,8 @@ static ssize_t do_process_madvise(int which, pid_t upid, struct iov_iter *iter,
> > return ret;
> > }
> >
> > -SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
> > - const struct iovec __user *, vec, unsigned long, vlen,
> > - int, behavior, unsigned long, flags)
> > +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> > + unsigned long, vlen, int, behavior, unsigned int, flags)
> > {
> > ssize_t ret;
> > struct iovec iovstack[UIO_FASTIOV];
> > @@ -1303,19 +1288,18 @@ SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid,
> >
> > ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), &iov, &iter);
> > if (ret >= 0) {
> > - ret = do_process_madvise(which, upid, &iter, behavior, flags);
> > + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > kfree(iov);
> > }
> > return ret;
> > }
> >
> > #ifdef CONFIG_COMPAT
> > -COMPAT_SYSCALL_DEFINE6(process_madvise, compat_int_t, which,
> > - compat_pid_t, upid,
> > +COMPAT_SYSCALL_DEFINE5(process_madvise, compat_int_t, pidfd,
> > const struct compat_iovec __user *, vec,
> > compat_ulong_t, vlen,
> > compat_int_t, behavior,
> > - compat_ulong_t, flags)
> > + compat_int_t, flags)
> >
> > {
> > ssize_t ret;
> > @@ -1326,7 +1310,7 @@ COMPAT_SYSCALL_DEFINE6(process_madvise, compat_int_t, which,
> > ret = compat_import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack),
> > &iov, &iter);
> > if (ret >= 0) {
> > - ret = do_process_madvise(which, upid, &iter, behavior, flags);
> > + ret = do_process_madvise(pidfd, &iter, behavior, flags);
> > kfree(iov);
> > }
> > return ret;
> > --
> > 2.26.2.761.g0e0b3e54be-goog
> >
>
> Reviewed-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>

Thanks for the review, Suren.

Andrew, I sent this patch without folding into previous syscall introducing
patches because it could be arguable. If you want to fold it into each
patchset(i.e., introdcuing process_madvise syscall and introducing
compat_syscall), let me know it. I will send partial diff to each
patchset.