Re: [PATCH v4 2/2] fanotify: allow reporting pidfds for reaped tasks

From: Christian Brauner

Date: Wed Jun 10 2026 - 03:45:07 EST

On Sun, Jun 07, 2026 at 07:45:21AM +0800, AnonymeMeow wrote:
> On 2026-06-05 09:54 +0200, Christian Brauner wrote:
> > On Fri, Jun 05, 2026 at 04:29:47AM +0800, AnonymeMeow wrote:
> > > On 2026-06-03 13:25 +0200, Christian Brauner wrote:
> > > > On Wed, Jun 03, 2026 at 12:58:09PM +0200, Amir Goldstein wrote:
> > > > > On Wed, Jun 3, 2026 at 12:28 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Wed, Jun 03, 2026 at 12:00:42PM +0200, Amir Goldstein wrote:
> > > > > > > On Wed, Jun 3, 2026 at 5:11 AM AnonymeMeow <anonymemeow@xxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Sashiko said that pidfs_register_pid() can fail under memory pressure,
> > > > > > > > since it allocates memory using GFP_KERNEL without a way to pass in
> > > > > > > > __GFP_NOFAIL. This can cause fanotify_alloc_event() to silently drop
> > > > > > > > the event even when the group has an unlimited queue length.
> > > > > > > >
> > > > > > > > Should we make pidfs registration best-effort instead and still return
> > > > > > > > the allocated event if pidfs_register_pid() fails? But IMO this will
> > > > > > > > make the API semantically ambiguous. If the kernel guarantees that the
> > > > > > > > event pid is registered with pidfs when pidfd reporting is requested,
> > > > > > > > then a later FAN_NOPIDFD or -ESRCH can only be caused by the task being
> > > > > > > > invisible within the reader's pid namespace. But if we allow best-effort
> > > > > > > > registration, the event may still carry a pinned struct pid, but pidfs
> > > > > > > > registration may have failed due to memory pressure. If userspace reads
> > > > > > > > the event after the task has been reaped, it still gets a FAN_NOPIDFD
> > > > > > > > or -ESRCH due to failed pidfs registration caused by memory pressure,
> > > > > > > > IMO this should be reported as a -ENOMEM instead of -ESRCH. But
> > > > > > > > currently it seems we don't have a good way to register the pid with
> > > > > > > > __GFP_NOFAIL or __GFP_RETRY_MAYFAIL, so what would be the preferred
> > > > > > > > approach to handle this situation?
> > > > > > >
> > > > > > > Thanks for noticing the oddity and explaining the options.
> > > > > > >
> > > > > > > We could fail the event in case on ENOMEM because the system
> > > > > > > is likely is a bad shape anyway and other events are likely to fail
> > > > > > > allocation themselves, so I am not objecting to your patch as is,
> > > > > > >
> > > > > > > but I am slightly leaning towards the semantically ambiguous API.
> > > > > >
> > > > > > As usual, I think a semantically ambiguous api should be avoided.
> > > > > >
> > > > > > > I don't think that we are going to change the man page to say that
> > > > > > > FAN_REPORT_PIDFD is guaranteed to return a pidfd, maybe we just
> > > > > > > add a NOTE about increased likelihood of getting a pidfd since kernel XXX
> > > > > > > and mention the known cases of pid namespace and ENOMEM.
> > > > > >
> > > > > > The pid namespace shouldn't matter.
> > > > >
> > > > > DO you mean that the event reader inside a pidns always gets a pidfd but
> > > > > it will not be able to query the pid of the process outside the pidns
> > > > > or do you mean something else?
> > > >
> > > > So they will always get a pidfd since fanotify is correctly using
> > > > pidfd_prepare() directly. But when the reader tries to resolve the pidfd
> > > > to say a pid or retrieve info via the pidfd info ioctl they would fail
> > > > to do so (well, the pid will resolve to 0 which is the global indicator
> > > > for "can't be resolved in your pidns").
> > > >
> > >
> > > Before pidfd_prepare(), fanotify checks whether metadata.pid is
> > > non-zero, i.e. whether the event pid is visible in the reader's pidns.
> > > If it's not visible, fanotify will not create a valid pidfd and return a
> > > FAN_NOPIDFD. This is the existing behavior, and my patches do not change
> > > it.
> >
> > We can hand that out. That's another change we should consider. The
> > pidfd works even if the pid namespace is outside the caller's hierarchy.
> >
>
> I may be missing something, but I am concerned that lifting this
> restriction could have security implications... Because it could weaken
> the isolation provided by pidns. In the usual case, a task inside a
> pidns cannot actively obtain a reference to tasks outside of that ns. If
> fanotify is allowed to hand out pidfds across pidns boundaries, a task
> inside the ns may be able to obtain access to resources that were
> supposed to be isolated from it, such as other tasks' namespaces.
>
> One possible attack surface I can think of is container escape. I wrote
> a small toy PoC which escapes from a podman container using this
> mechanism. Admittedly, the PoC requires sudo as well as SYS_ADMIN and
> SYS_PTRACE caps, so it is not practical by itself. However, I am worried
> that in other circumstances this could become a real attack surface.
>
> I'm no expert in this area, so I do not think I am in a good position to
> make that call myself... For now, I'm not planning to include removal of
> this restriction in my patch set.

You would need ptrace_may_access() rights to the process for that to work.
If this is a real concern we should simply scope to the hierarchy
independent of the fanotify changes. Thanks for looking into this
closely. I appreciate it.