[PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression

From: A. Sterling

Date: Fri Jun 12 2026 - 07:05:45 EST


Commit cb12fd8e0dab ("pidfd: add pidfs") moved pidfds from anonymous
inodes to a pseudo filesystem, with the intent that pid->stashed would
cache the dentry+inode across multiple pidfd_open() calls for the same
struct pid. Commit b28ddcc32d8f ("pidfs: convert to path_from_stashed()
helper") realized that intent via the stashed_dentry_get() fast-path.

In practice the stash cache never hits for serial open/close patterns.
pidfs dentries are anonymous (d_alloc_anon) and the superblock raises
DCACHE_DONTCACHE, so dput() at close() drops the refcount to zero and
immediately destroys the dentry. __dentry_kill -> d_prune ->
stashed_dentry_prune cmpxchg()es pid->stashed back to NULL, and the
last iput() runs pidfs_evict_inode(). The next pidfd_open() finds an
empty stash slot and reallocates both the inode (new_inode_pseudo +
simple_inode_init_ts + sops->init_inode) and the dentry (d_alloc_anon
+ d_instantiate + cmpxchg into the stash slot).

Measured with a small benchmark that times a single pidfd_open()
syscall on a live process, best-of-k per run (written for this
investigation as part of an extended fork of the LEBench suite,
Ren et al. SOSP'19; the test is not part of upstream LEBench).
QEMU/KVM x86_64 guest, 5 runs per kernel, RSD <1%, kernels v5.19
to v7.1-rc2:

v5.19 v6.0 v6.9 (pidfs) v7.1-rc2
pidfd_open kbest (us) 0.40 0.39 [step] 0.50

i.e. ~25% slower than the pre-pidfs anon_inode baseline, persistent
through current. The cost is a kmem_cache_alloc on inode_cachep + a
clock read (simple_inode_init_ts) + a kmem_cache_alloc on dentry_cache
per call, all of which can be elided when the stash cache hits.

With this patch applied to v7.1-rc4 (same guest, median kbest
across 5 boots, kernels built with the pidfd selftest config
requirements enabled):

v7.1-rc4 stock v7.1-rc4 + patch
pidfd_open kbest (us) 0.489 0.349 (-28.6%)

All eight buildable pidfd kselftest suites (41 tests, including the
file_handle suite that exercises the stashed-dentry path with
CLONE_NEWUSER children) pass identically on the stock and patched
kernels.

The patched kernel is faster than the pre-pidfs anon_inode baseline
(0.40 us at v5.19/v6.0): a stash hit reuses the cached dentry and
inode outright, where the anon_inode path still performed per-call
file setup against the shared inode.

Fix by holding an extra reference to the stashed dentry for as long as
the task is alive. pidfs_alloc_file() takes a dget() on the first
successful path_from_stashed() per pid, recorded by a new
PIDFS_ATTR_BIT_STASH_PINNED bit in pid->attr->attr_mask. pidfs_exit()
-- which already runs in sleepable context from release_task() --
drops the reference.

pidfs_alloc_file() is reachable after pidfs_exit() has already run:
SO_PEERPIDFD and SCM_PIDFD hold struct pid references on sockets and
can allocate pidfds for already-reaped peers (the PIDFD_STALE path).
A pin taken on that path could never be dropped, since pidfs_exit()
will not run again. Both the pin and the unpin therefore take
pid->wait_pidfd.lock -- the lock pidfs_exit() already uses to
synchronize with pidfs_register_pid() -- and the pin is skipped once
PIDFS_ATTR_BIT_EXIT is set:

- pin before exit's unpin section: the unpin sees the bit, drops
the reference. Balanced.
- exit's unpin section before the pin attempt: PIDFS_ATTR_BIT_EXIT
is set before the unpin section, so the later allocation observes
it under the lock and does not pin. Stale allocations behave as
they do today (per-call inode+dentry), which is acceptable: the
cache targets live processes.

The unpin's dput() runs outside the lock (pidfs_exit() is sleepable;
dput must not run under a spinlock). While pinned, the reference
keeps the dentry alive, so stashed_dentry_prune() cannot run and
pid->stashed still points at the pinned dentry when pidfs_exit()
reads it.

Unpinning at pidfs_free_pid() instead is not possible: the pinned
dentry holds the inode, the inode holds a pid reference
(pidfs_evict_inode() does put_pid()), so the pid refcount cannot
reach zero while the pin is held and pidfs_free_pid() would never
run.

Memory cost is one inode (~800 B) + one dentry (~200 B) per pid that
has ever had a pidfd allocated, held until release_task() instead of
until the last fd close. Bounded by pid_max. Tasks that never have
a pidfd allocated are unaffected.

The existing VFS_WARN_ON_ONCE(pid->stashed) assertion in
pidfs_free_pid() is preserved: the exit-time dput triggers the
existing teardown chain (__dentry_kill -> stashed_dentry_prune ->
pidfs_evict_inode) before pidfs_free_pid runs.

Fixes: cb12fd8e0dab ("pidfd: add pidfs")
Assisted-by: Claude:claude-fable-5
Signed-off-by: A. Sterling <adam.sterling@xxxxxxxxx>
---
fs/pidfs.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -44,8 +44,9 @@
}

enum pidfs_attr_mask_bits {
- PIDFS_ATTR_BIT_EXIT = 0,
- PIDFS_ATTR_BIT_COREDUMP = 1,
+ PIDFS_ATTR_BIT_EXIT = 0,
+ PIDFS_ATTR_BIT_COREDUMP = 1,
+ PIDFS_ATTR_BIT_STASH_PINNED = 2,
};

struct pidfs_anon_attr {
@@ -725,6 +726,7 @@
{
struct pid *pid = task_pid(tsk);
struct pidfs_attr *attr;
+ struct dentry *unpin = NULL;
#ifdef CONFIG_CGROUPS
struct cgroup *cgrp;
#endif
@@ -765,6 +767,23 @@
/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
smp_wmb();
set_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask);
+
+ /*
+ * Drop the per-pid stash reference taken by pidfs_alloc_file()
+ * (see PIDFS_ATTR_BIT_STASH_PINNED). The wait_pidfd.lock guard
+ * orders this against the pin in pidfs_alloc_file(): once
+ * PIDFS_ATTR_BIT_EXIT is visible under the lock, no new pin can
+ * be taken. dput() runs outside the lock; on the final
+ * reference it triggers __dentry_kill -> stashed_dentry_prune
+ * -> pidfs_evict_inode, the existing teardown path.
+ */
+ scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
+ if (test_and_clear_bit(PIDFS_ATTR_BIT_STASH_PINNED,
+ &attr->attr_mask))
+ unpin = READ_ONCE(pid->stashed);
+ }
+ if (unpin)
+ dput(unpin);
}

#ifdef CONFIG_COREDUMP
@@ -1151,6 +1170,32 @@

VFS_WARN_ON_ONCE(!pid->attr);

+ /*
+ * Pin the stashed dentry for the lifetime of the task so that
+ * repeat pidfd_open() on the same pid reuses the cached dentry
+ * + inode instead of reallocating both each time. Anonymous
+ * pidfs dentries don't enter the dcache LRU (DCACHE_DONTCACHE),
+ * so without an explicit reference the dput() at close()
+ * immediately destroys the dentry and, via
+ * stashed_dentry_prune, the inode.
+ *
+ * Skip the pin once PIDFS_ATTR_BIT_EXIT is set: pidfs_exit()
+ * has run (or is running) and will not run again, so a
+ * reference taken now could never be dropped (stale
+ * allocations via SO_PEERPIDFD reach here after
+ * release_task()). The wait_pidfd.lock guard makes the EXIT
+ * check and the pin atomic against the unpin in pidfs_exit().
+ */
+ scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
+ struct pidfs_attr *attr = pid->attr;
+
+ if (!IS_ERR_OR_NULL(attr) &&
+ !test_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask) &&
+ !test_and_set_bit(PIDFS_ATTR_BIT_STASH_PINNED,
+ &attr->attr_mask))
+ dget(path.dentry);
+ }
+
flags &= ~PIDFD_STALE;
flags |= O_RDWR;
pidfd_file = dentry_open(&path, flags, current_cred());
--
2.45.1