Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru

From: Jeff Layton

Date: Wed Apr 08 2026 - 14:38:17 EST

On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote:
> On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> > We've had a number of panics that seem to occur on hosts with heavy
> > process churn. The symptoms are a panic when invalidating /proc entries
> > as a task is exiting:
> >
> > queued_spin_lock_slowpath+0x153/0x270
> > shrink_dentry_list+0x11d/0x220
> > shrink_dcache_parent+0x68/0x110
> > d_invalidate+0x90/0x170
> > proc_invalidate_siblings_dcache+0xc8/0x140
> > release_task+0x41b/0x510
> > do_exit+0x3d8/0x9d0
> > do_group_exit+0x7d/0xa0
> > get_signal+0x2a9/0x6a0
> > arch_do_signal_or_restart+0x1a/0x1c0
> > syscall_exit_to_user_mode+0xe6/0x1c0
> > do_syscall_64+0x74/0x130
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> >
> > The problem appears to be a UAF. It's freeing a shrink list of
> > dentries, but one of the dentries on it has already been freed.
>
> That, or dentry pointer passed to shrink_dcache_parent() is a
> complete garbage - e.g. due to struct pid having already been
> freed. Might make sense to try and get a crash dump and poke
> around...
>

Chris was able to track me down a vmcore.

No, it actually does seem to be what we thought originally. The parent
is fine, but one of the dentries under it has been freed and
reallocated:

>>> stack
#0 queued_spin_lock_slowpath (kernel/locking/qspinlock.c:471:3)
#1 spin_lock (./include/linux/spinlock.h:351:2)
#2 lock_for_kill (fs/dcache.c:675:3)
#3 shrink_dentry_list (fs/dcache.c:1086:8)
#4 shrink_dcache_parent (fs/dcache.c:0)
#5 d_invalidate (fs/dcache.c:1614:2)
#6 proc_invalidate_siblings_dcache (fs/proc/inode.c:142:5)
#7 proc_flush_pid (fs/proc/base.c:3478:2)
#8 release_task (kernel/exit.c:279:2)
#9 exit_notify (kernel/exit.c:775:3)
#10 do_exit (kernel/exit.c:958:2)
#11 do_group_exit (kernel/exit.c:1087:2)
#12 get_signal (kernel/signal.c:3036:3)
#13 arch_do_signal_or_restart (arch/x86/kernel/signal.c:337:6)
#14 exit_to_user_mode_loop (kernel/entry/common.c:111:4)
#15 exit_to_user_mode_prepare (./include/linux/entry-common.h:329:13)
#16 __syscall_exit_to_user_mode_work (kernel/entry/common.c:207:2)
#17 syscall_exit_to_user_mode (kernel/entry/common.c:218:2)
#18 do_syscall_64 (arch/x86/entry/common.c:89:2)
#19 entry_SYSCALL_64+0x6c/0xaa (arch/x86/entry/entry_64.S:121)
#20 0x7f49ead2c482
>>> identify_address(stack[3]["dentry"])
'slab object: kmalloc-96+0x48'
>>> identify_address(stack[4]["parent"])
'slab object: dentry+0x0'

...it turns out that Gustavo had been chasing this independently to me,
and had Claude do a bit more analysis. I included it below, but here's
a link that may be more readable. Any thoughts?

https://markdownpastebin.com/?id=7c258413493b4144ab27d5cdcb8ae5b4

-------------8<----------------------

## dcache: `shrink_dcache_parent()` livelock leading to use-after-free

### Summary

A race between concurrent proc dentry invalidation (`proc_flush_pid` →
`d_invalidate` → `shrink_dcache_parent`) and the global dentry shrinker
(`drop_caches` / memory pressure → `prune_dcache_sb`) causes
`shrink_dcache_parent()` to loop indefinitely. This livelock is the
root cause of the use-after-free crash observed in production (see
P2260313060 for the original crash analysis).

### How the bug manifests

**In production** (narrow race window): The livelock occasionally
resolves through specific timing that allows a parent dentry to be
freed and its slab page reused. When a sibling's `__dentry_kill` then
tries `spin_lock(&parent->d_lock)` on the reused memory → page fault in
`queued_spin_lock_slowpath` (Oops).

**With `CONFIG_DCACHE_SHRINK_RACE_DEBUG`** (5ms delay in
`__dentry_kill`): The race is deterministic. `shrink_dcache_parent()`
livelocks on the first iteration and never completes.

### Root cause

In `select_collect()` (the `d_walk` callback used by
`shrink_dcache_parent`), two types of dentries are incorrectly counted
as "found":

1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
`lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
`dentry_unlist()` to remove the dentry from the parent's children list.
With the debug delay, the dentry stays dead-but-visible for 5ms.

2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
own dispose list. These are being processed by that other path but
slowly (5ms per proc dentry with the debug delay).

When `select_collect` counts these as `found++`,
`shrink_dcache_parent()` sees `data.found > 0` and loops again. But
these dentries can never be collected onto `data.dispose` (dead ones
have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
set), so the loop never makes progress → **infinite loop**.

```
shrink_dcache_parent() loop:
for (;;) {
d_walk(parent, &data, select_collect);
if (!list_empty(&data.dispose)) {
shrink_dentry_list(&data.dispose); // never reached
continue;
}
if (!data.found)
break; // never reached because found > 0
// ... loops forever
}
```

### Reproducer

**Requirements:**
- `CONFIG_DCACHE_SHRINK_RACE_DEBUG=y` (injects 5ms `mdelay()` in
`__dentry_kill` for proc dentries)
- `CONFIG_KASAN=y` (optional, for UAF detection)
- `CONFIG_DEBUG_KERNEL=y`

**Debug patch** (apply to `fs/Kconfig` and `fs/dcache.c`):

```diff
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -9,6 +9,15 @@ menu "File systems"
config DCACHE_WORD_ACCESS
bool

+config DCACHE_SHRINK_RACE_DEBUG
+ bool "Debug: inject delay in __dentry_kill to widen race
window"
+ depends on DEBUG_KERNEL
+ default n
+ help
+ Inject a delay in __dentry_kill() between releasing d_lock
and
+ re-acquiring it, to make the shrink_dentry_list race
reproducible
+ in test environments. Only enable for testing.
+
config VALIDATE_FS_PARSER

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,6 +32,7 @@
#include <linux/list_lru.h>
+#include <linux/delay.h>
#include "internal.h"

@@ -630,6 +631,16 @@ static struct dentry *__dentry_kill(...)
cond_resched();
+#ifdef CONFIG_DCACHE_SHRINK_RACE_DEBUG
+ /*
+ * Delay proc dentry kills to keep dead dentries in the tree
+ * longer. With the bug (count < 0 counted as "found" in
+ * select_collect), d_walk keeps re-finding dead dentries and
+ * shrink_dcache_parent() loops forever.
+ */
+ if (dentry->d_sb->s_magic == 0x9fa0 /* PROC_SUPER_MAGIC */)
+ mdelay(5);
+#endif
/* now that it's negative, ->d_parent is stable */
```

**Test program** (`test_dcache_race.sh`):

The reproducer creates multi-threaded processes, populates their
`/proc/<pid>/task/<tid>/...` dcache entries, then SIGKILLs them while
simultaneously running `drop_caches` in tight loops. This creates
concurrent `proc_flush_pid` (from dying threads) and `prune_dcache_sb`
(from `drop_caches`) paths competing on the same proc dentries.

```c
/* Key structure:
* - Fork child with N threads (creates /proc/<pid>/task/<tid>/...
entries)
* - Parent reads all /proc entries to populate dcache
* - Background threads continuously do: echo 2 >
/proc/sys/vm/drop_caches
* - SIGKILL child -> all threads exit -> concurrent proc_flush_pid
* - drop_caches shrinker races with proc_flush_pid on same dentries
*/
```

Parameters used: 50 threads/process, 200 iterations, 4 shrinker
threads, 4 reader threads.

**vmtest.toml:**
```toml
[[target]]
name = "dcache-shrink-race"
kernel = "arch/x86/boot/bzImage"
kernel_args = "hung_task_panic=0 softlockup_panic=0
rcupdate.rcu_cpu_stall_suppress=1"
command = "/mnt/vmtest/test_dcache_race.sh"

[target.vm]
memory = "16G" # KASAN needs extra memory
num_cpus = 8
timeout = 1200
```

### Reproduction results

| Kernel | Result |
|---|---|
| Unfixed + debug delay + KASAN | **FAIL**: livelock on iteration 1,
test timed out at 750s |
| Fixed + debug delay + KASAN | **PASS**: all 200 iterations completed,
no KASAN/warnings |

### Fix

The fix is in `select_collect()` — stop counting dead dentries and
`DCACHE_SHRINK_LIST` dentries as "found":

```diff
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1448,13 +1459,27 @@ static enum d_walk_ret select_collect(void
*_data, struct dentry *dentry)
if (data->start == dentry)
goto out;

- if (dentry->d_flags & DCACHE_SHRINK_LIST) {
- data->found++;
+ if (dentry->d_lockref.count < 0) {
+ /*
+ * Dead dentry (lockref_mark_dead sets count
negative).
+ * Another CPU is in the middle of __dentry_kill() and
+ * will shortly unlink it from the tree. Do not count
+ * it as "found" --- that causes
shrink_dcache_parent()
+ * to loop indefinitely.
+ */
+ } else if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+ /*
+ * Already on a shrink list, being processed by
another
+ * path (e.g., the global LRU shrinker). Do not count
+ * it as "found" --- if the other path is slow (e.g.,
+ * contention on d_lock or filesystem callbacks),
+ * shrink_dcache_parent() would spin forever waiting
for
+ * them to finish. The other shrinker will handle
these
+ * dentries.
+ */
} else if (!dentry->d_lockref.count) {
to_shrink_list(dentry, &data->dispose);
data->found++;
- } else if (dentry->d_lockref.count < 0) {
- data->found++;
}
```

**Why this is correct:**

- **Dead dentries (`count < 0`)**: These are being killed by another
CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
them from the parent's children list. `shrink_dcache_parent()` doesn't
need to wait for them — they'll disappear from the tree on their own.

- **`DCACHE_SHRINK_LIST` dentries**: These are already on another
shrinker's dispose list and will be processed by that path. Counting
them as "found" forces `shrink_dcache_parent()` to wait for the other
shrinker to finish, which can take arbitrarily long (especially with
filesystem callbacks or the debug delay).

- **The `select_collect2` path** (used when `data.found > 0` but
`data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
separately by setting `data->victim` and processing them directly. With
this fix, `select_collect2` is only reached when there are genuinely
unprocessable dentries (count > 0, not dead, not on shrink list), not
when there are merely in-flight kills or concurrent shrinkers.

### Relationship to the production UAF crash

The livelock is the **precursor** to the use-after-free crash seen in
production (P2260313060):

1. Without the debug delay, the `__dentry_kill` race window is
nanoseconds (just `cond_resched()`).
2. Most of the time, the dead dentry is unlinked before
`select_collect` finds it → no issue.
3. Occasionally, `select_collect` finds dead dentries and spins
briefly. During this spinning, the specific timing allows a parent
dentry to be fully freed (via `dentry_free` → `call_rcu` → slab
reclaim) and its slab page reused for `kmalloc-96`.
4. When the spinning `shrink_dcache_parent` or a concurrent
`__dentry_kill` then accesses the freed parent → UAF crash.

The fix prevents the spinning entirely, eliminating both the livelock
and the UAF.