[PATCH RFC POC] pidfd: make ptrace and pidfd work together

From: Nick Alcock
Date: Fri Jun 11 2021 - 08:09:34 EST


These changes are a proof-of-concept to show how pidfd might be usable
to determine when a waitpid on a ptraced process would yield a response
without blocking. This can be useful if you want a 'ptrace helper'
thread (or process) that accepts messages regarding what ptrace work
needs doing and also ptrace-monitors other processes to do that work.
It is obviously best to do the interprocess messaging part via a file
descriptor, but that means you have to poll() it, and if you're stuck in
poll() you can't wake up in time when one of your ptrace-monitored
processes needs attention, since waitpid does not yield an fd. This
situation is tailor-made for pidfds!

Up until now I've handled this with a really rather ugly hack on the old
(long-rejected) waitfd, see
e.g. https://github.com/oracle/dtrace-linux-kernel/commit/81e92377b8e2f89fc0cde898738321dd869f21d2.
This commit is nasty in all sorts of ways and pidfds would be better in
every respect were it not that right now pidfds don't wake up pollers in
response to ptrace-induced waits.[1]

This patch fixes it! But... it's not good enough, because fixing it by
just lifting those restrictions breaks pidfd's interface contract in two
ways: of necessity it causes poll() to wake up if a ptraced child is
ready to respond, which might confuse existing users that use ptrace with
unexpected wakeups, but also ptrace's signalling is thread-directed, and
relates to single threads, so we must lift the restriction that pidfds
can only be used on thread-group leaders.

I don't know why this restriction was imposed, so I don't know what
ramifications lifting it might have. So both are best placed under pidfd
flags to make sure that pidfd users only get this behaviour if they want
it, perhaps named PIDFD_THREADS and PIDFD_PTRACE: to avoid possible
problems with non-ptrace-directed pthread usage, perhaps PIDFD_THREADS
should currently only be allowed in conjunction with PIDFD_PTRACE.
But... PIDFD_PTRACE needs to take effect after pidfd_create, at poll time
(in order to invoke do_notify_pidfd only when the flag is passed), but
I can see nowhere obvious to stash this flag -- and I'm not even sure this
approach makes sense. So this certainly needs fixing and this patch must
for now just stand as evidence that having waitfds wake up ptraced waiters
is *really easy*. I'd be very happy if you could suggest any way to get
flags from waitfd_open down into poll, so this could turn into something
that doesn't break the interface contract and is actually useful to other
people...

A trivial testcase (with a trivial singlethreaded child, and no
messaging layer) is provided so you can see the sort of thing I'm trying
to make work.

[1] It is also true that pidfds integrate with *waitid()*, not
waitpid(), and waitid() cannot be used to interact with ptrace --
but that's OK, because you can use a pidfd awakening from poll as
proof that waitpid() would return with ptraced info if called, and
then call waitpid() on the PID directly, without going via the
pidfd. This is not ideal, since it would be nice to be able to use
waitid with P_PIDFD to get the info directly without also needing
the associated process ID, but as long as waitid() masks out and
throws away information returned by waitpid() for ptraced waits, it
is the best we can do. This is implied by the ptrace manpage, which
never mentions waitid, because only waitpid works.

Signed-off-by: Nick Alcock <nick.alcock@xxxxxxxxxx>
---
kernel/fork.c | 9 +-
kernel/pid.c | 5 +-
kernel/signal.c | 10 ++-
tools/testing/selftests/pidfd/.gitignore | 1 +
tools/testing/selftests/pidfd/Makefile | 2 +-
.../selftests/pidfd/pidfd_ptrace_test.c | 82 +++++++++++++++++++
6 files changed, 93 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd_ptrace_test.c

Christian: I briefly mentioned this after your talk at the 2019 LPC,
and have finally got round to trying it and coming up with a testcase.
With this patch, things work! Only of course they break everyone else
so we need to find a way not to do that :)

The full-sized monster userspace code that inspired this is here:
<https://github.com/oracle/dtrace-utils/blob/dev/libdtrace/dt_proc.c#L939>
and here for the ptrace side
<https://github.com/oracle/dtrace-utils/blob/dev/libproc/Pcontrol.c#L717>.
(The ad-hoc protocol this implements to get ptrace requests from the
rest of the process is here, though you will probably not find this
too interesting except as evidence that if this thread is processing
those messages it can't also be spending all its time blocked inside
a blocking waitpid:
<https://github.com/oracle/dtrace-utils/blob/dev/libdtrace/dt_proc.c#L1124>)
Changes needed to make it work with waitpids instead of waitfds:
<https://github.com/oracle/dtrace-utils/commit/1c15a5c4ac26507f6bdb53cf86a1c4d2baba2390>
(Yes, that's *it*.)

diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..0ba53b2858a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1787,14 +1787,7 @@ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)

poll_wait(file, &pid->wait_pidfd, pts);

- /*
- * Inform pollers only when the whole thread group exits.
- * If the thread group leader exits before all other threads in the
- * group, then poll(2) should block, similar to the wait(2) family.
- */
- if (thread_group_exited(pid))
- poll_flags = EPOLLIN | EPOLLRDNORM;
-
+ poll_flags = EPOLLIN | EPOLLRDNORM;
return poll_flags;
}

diff --git a/kernel/pid.c b/kernel/pid.c
index ebdf9c60cd0b..98703b44c29f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -596,10 +596,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
if (!p)
return -ESRCH;

- if (pid_has_task(p, PIDTYPE_TGID))
- fd = pidfd_create(p, flags);
- else
- fd = -EINVAL;
+ fd = pidfd_create(p, flags);

put_pid(p);
return fd;
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..2b2693e7f934 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1940,11 +1940,12 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
return ret;
}

-static void do_notify_pidfd(struct task_struct *task)
+static void do_notify_pidfd(struct task_struct *task, int warn)
{
struct pid *pid;

- WARN_ON(task->exit_state == 0);
+ if (warn)
+ WARN_ON(task->exit_state == 0);
pid = task_pid(task);
wake_up_all(&pid->wait_pidfd);
}
@@ -1973,7 +1974,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
(tsk->group_leader != tsk || !thread_group_empty(tsk)));

/* Wake up all pidfd waiters */
- do_notify_pidfd(tsk);
+ do_notify_pidfd(tsk, 1);

if (sig != SIGCHLD) {
/*
@@ -2083,6 +2084,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
parent = tsk->real_parent;
}

+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk, 0);
+
clear_siginfo(&info);
info.si_signo = SIGCHLD;
info.si_errno = 0;
diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index 973198a3ec3d..ebb5c30ba691 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -6,3 +6,4 @@ pidfd_wait
pidfd_fdinfo_test
pidfd_getfd_test
pidfd_setns_test
+pidfd_ptrace_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index f4a2f28f926b..c2dd13fd9557 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -2,7 +2,7 @@
CFLAGS += -g -I../../../../usr/include/ -pthread

TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test \
- pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test
+ pidfd_poll_test pidfd_wait pidfd_getfd_test pidfd_setns_test pidfd_ptrace_test

include ../lib.mk

diff --git a/tools/testing/selftests/pidfd/pidfd_ptrace_test.c b/tools/testing/selftests/pidfd/pidfd_ptrace_test.c
new file mode 100644
index 000000000000..29c58190a466
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_ptrace_test.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <sys/ptrace.h>
+#include <poll.h>
+#include <signal.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "../kselftest.h"
+
+#ifndef PTRACE_O_EXITKILL
+#define PTRACE_O_EXITKILL (1 << 20)
+#endif
+
+int main(int argc, char **argv)
+{
+ int pidfd, forkblock[2];
+ pid_t pid;
+ int ret = 1;
+ struct pollfd pfd[1];
+ const char *err;
+
+ ksft_set_plan(1);
+
+ if (pipe(forkblock) < 0) {
+ ksft_print_msg("%s - failed to set up pipes\n", strerror(errno));
+ goto on_error;
+ }
+
+ if ((pid = fork()) < 0) {
+ ksft_print_msg("%s - failed to fork\n", strerror(errno));
+ goto on_error;
+ }
+
+ if (pid == 0) { /* child */
+ int dummy;
+ close(forkblock[1]);
+ read(forkblock[0], &dummy, 1);
+ close(forkblock[0]);
+ execlp("true", "true", NULL);
+ _exit(127);
+ }
+ close(forkblock[0]);
+
+ if (ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_EXITKILL |
+ PTRACE_O_TRACEEXEC | PTRACE_O_TRACEEXIT) < 0) {
+ ksft_print_msg("%s - failed to ptrace\n");
+ kill(pid, SIGKILL);
+ goto on_error;
+ }
+ close(forkblock[1]);
+
+ pidfd = sys_pidfd_open(pid, 0);
+ if (pidfd < 0) {
+ ksft_print_msg(
+ "%s - failed to to open pidfd for pid %i\n",
+ strerror(errno), pid);
+ goto on_error;
+ }
+
+ pfd[0].events = POLLIN;
+ pfd[0].fd = pidfd;
+
+ while (errno = EINTR,
+ poll((struct pollfd *) pfd, 1, 10000) <= 0 && errno == EINTR)
+ continue;
+
+ if (pfd[0].revents == 0) {
+ ksft_test_result_fail("timed out: pfd polling insensitive to ptrace awakens: failed\n");
+ } else {
+ ksft_test_result_pass("pfd polling sensitive to ptrace awakens: passed\n");
+ }
+
+ ret = 0;
+
+on_error:
+ if (pidfd >= 0)
+ close(pidfd);
+
+ return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}

base-commit: 231bc539066760aaa44d46818c85b14ca2f56d9f
prerequisite-patch-id: 230338bd1e3bed095efd9b46607113f64b99bd97
--
2.31.0.253.gdec51257f3