[REPORT] futex: private hash refcount corruption and UAF via CLONE_VM PR_FUTEX_HASH race
From: 钱一铭
Date: Thu Apr 30 2026 - 05:39:28 EST
Hello,
I found a race in the futex private hash implementation that can
corrupt the private-hash reference state and free a live `struct
futex_private_hash`. I can trigger a KASAN slab-use-after-free in
`futex_hash_put()` from an unprivileged process inside QEMU.
## Summary
`futex_hash_allocate()` lazily initializes `current->mm->futex_ref`
without serialization. The code comment assumes the allocation is
always performed by the first thread, but `clone(CLONE_VM | SIGCHLD)`
creates another task that shares the same `mm_struct` while not
setting `CLONE_THREAD`. Such tasks bypass
`need_futex_hash_allocate_default()` and can concurrently call
`prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, ...)` while
`mm->futex_ref` is still `NULL`.
As a result, multiple tasks sharing one `mm` can race the
`mm->futex_ref = alloc_percpu(...)` publication. Private hash objects
may then have their initial/reference activity accounted against a
percpu counter that is no longer reachable through `mm->futex_ref`.
Later `futex_ref_get()` and `futex_ref_put()` use the current
`mm->futex_ref`, not a counter tied to the `struct futex_private_hash`
whose bucket is being used. This lets the RCU pivot/refcount state
machine incorrectly conclude that an old private hash is dead, call
`kvfree_rcu(fph, rcu)`, and leave a futex waiter with a bucket pointer
into the freed bucket array.
## Affected code
Tested latest upstream head:
- `3b3bea6d4b9c162f9e555905d96b8c1da67ecd5b` (`7.1.0-rc1-g3b3bea6d4b9c`)
The same relevant code is present in the UAF-triggering head:
- `254f49634ee16a731174d2ae34bc50bd5f45e731` (`7.1.0-rc1-g254f49634ee1`)
Important locations from latest upstream:
- `kernel/futex/core.c:172`: `futex_hash_put()` dereferences `hb->priv`.
- `kernel/futex/core.c:257`: `__futex_pivot_hash()` frees the old
private hash via `kvfree_rcu()`.
- `kernel/futex/core.c:302`: `futex_hash()` returns a private-hash
bucket and takes the private hash reference.
- `kernel/futex/core.c:1682`: `futex_ref_get()` accounts references
through `mm->futex_ref`.
- `kernel/futex/core.c:1696`: `futex_ref_put()` drops references
through `mm->futex_ref`.
- `kernel/futex/core.c:1804`: unlocked lazy `if (!mm->futex_ref)` check.
- `kernel/futex/core.c:1809`: unlocked assignment to `mm->futex_ref`.
- `kernel/fork.c:1952`: default private futex hash allocation only
covers `CLONE_THREAD | CLONE_VM`.
- `kernel/fork.c:2387`: default allocation site in `copy_process()`.
- `kernel/futex/waitwake.c:624`: `CLASS(hb, hb)(&q->key)` holds the
bucket reference in `futex_wait_setup()`.
## Root cause
The private futex hash reference state is per-`mm`, but the first
allocation of `mm->futex_ref` is not protected by
`mm->futex_hash_lock` or another publish-once primitive. The invariant
assumed by the comment at `kernel/futex/core.c:1806` does not hold for
`CLONE_VM` tasks that are not `CLONE_THREAD` tasks:
1. A process forks an attempt process.
2. The attempt process creates multiple `clone(CLONE_VM | SIGCHLD)`
children. They share one `mm_struct` but are separate thread groups.
3. Because `CLONE_THREAD` is absent,
`need_futex_hash_allocate_default()` did not pre-initialize the
private futex hash/ref state.
4. Several children concurrently invoke `prctl(PR_FUTEX_HASH,
PR_FUTEX_HASH_SET_SLOTS, ...)`.
5. More than one task can observe `mm->futex_ref == NULL`, allocate a
percpu counter, assign `mm->futex_ref`, and increment its own new
counter.
6. The private hash pivot/refcount code later sums and mutates
whichever percpu counter is currently stored in `mm->futex_ref`, which
may not be the counter that received earlier references.
7. `futex_ref_is_dead()` can become true for a private hash that is
still used by a waiter, allowing `__futex_pivot_hash()` to free the
old hash.
8. A waiter blocked in `futex_wait_setup()` on a userfaultfd-delayed
futex word later exits the `CLASS(hb, hb)` scope and calls
`futex_hash_put()` on a bucket inside the freed hash object.
## Trigger conditions
The PoC uses only unprivileged operations without requiring
CAP_SYS_PTRACE or `vm.unprivileged_userfaultfd=1`:
- `clone(CLONE_VM | SIGCHLD)` children, not `CLONE_THREAD`.
- `prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, ...)` from multiple
CLONE_VM children.
- private `futex(FUTEX_WAIT_PRIVATE)` on a missing
`UFFD_USER_MODE_ONLY` userfaultfd page to hold `futex_wait_setup()`
after it has acquired a private hash bucket reference.
- more `PR_FUTEX_HASH_SET_SLOTS` calls while the waiter is
fault-blocked to force private-hash transitions.
## KASAN evidence
The KASAN UAF was reproduced inside QEMU on upstream `254f49634ee16`
with the same relevant futex/fork code as current upstream. The
process is uid 1000:
```text
BUG: KASAN: slab-use-after-free in futex_hash_put+0x130/0x160
Read of size 8 at addr ffff888006e4b218 by task test/154
CPU: 1 UID: 1000 PID: 154 Comm: test Not tainted
7.1.0-rc1-g254f49634ee1 #3 PREEMPT(full)
...
futex_hash_put+0x130/0x160
futex_wait_setup+0x2c3/0x450
__futex_wait+0x134/0x270
futex_wait+0xc8/0x390
do_futex+0x1af/0x240
__x64_sys_futex+0x172/0x440
```
Allocation/free attribution from the same report:
```text
Allocated by task 176:
__kvmalloc_node_noprof+0x1cf/0x650
futex_hash_allocate+0x1dd/0xd80
__do_sys_prctl+0xcf5/0x16c0
Freed by task 0:
__kasan_slab_free+0x43/0x70
__rcu_free_sheaf_prepare+0x65/0x240
rcu_free_sheaf+0x1b/0x100
rcu_core+0x511/0x1930
```
Full log is saved as `qemu_run_02.log`; compact excerpt is
`kasan_uaf_futex_hash_put_excerpt.log`.
## Latest upstream verification
On latest upstream `3b3bea6d4b9c`, the same PoC and relevant code
reached the same broken reference-state paths. In an 80-attempt run it
triggered both refcount state WARNs:
```text
WARNING: kernel/futex/core.c:1607 at futex_ref_rcu+0x218/0x2c0
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted
7.1.0-rc1-g3b3bea6d4b9c #4 PREEMPT(full)
```
and:
```text
WARNING: kernel/futex/core.c:1556 at __futex_ref_atomic_begin+0xde/0x120
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Tainted: G W
7.1.0-rc1-g3b3bea6d4b9c #4 PREEMPT(full)
```
I did not get another KASAN UAF on latest within that run, but there
is no relevant code change between the UAF head and latest head in
`kernel/futex/core.c`, `kernel/fork.c`, `include/linux/mm_types.h`, or
`include/uapi/linux/prctl.h`. The latest WARNs demonstrate that the
same refcount corruption is still reachable.
For the latest-upstream serial-console run I used:
```sh
qemu-system-x86_64 \
-m 2G \
-kernel ./bzImage \
-initrd rootfs.cpio \
-append 'console=ttyS0 earlyprintk=serial,ttyS0,115200 loglevel=7
kaslr panic=-1' \
-monitor none \
-serial stdio \
-display none \
-cpu kvm64,+smep,+smap \
-smp cores=4,threads=1 \
-no-reboot
```
## Known-issue check
I checked local `oldvulns.md`, `race-vulns`, and `vuln-desc2`; no
matching `futex_hash_put`/`PR_FUTEX_HASH` UAF entry was found.
I also checked Patchwork with these keywords and got no matching patch results:
- `PR_FUTEX_HASH`
- `futex_hash_put`
- `mm->futex_ref`
- `futex_private_hash`
- `futex_hash_allocate CLONE_VM`
There is an existing syzbot report titled `KASAN: slab-use-after-free
Read in futex_unqueue`
(`syzbot+6c1861115b4253e45969@xxxxxxxxxxxxxxxxxxxxxxxxx`). That one
reports a different crash site (`futex_unqueue`) and its proposed
patch only changes `futex_hash_free()` to use `kvfree_rcu()`. The
issue reported here is a separate `mm->futex_ref`
initialization/refcount mismatch that frees an old private hash
through the pivot path and crashes at `futex_hash_put()` after
`PR_FUTEX_HASH` races.
The fix should also cover `CLONE_VM` without `CLONE_THREAD`, because
such tasks share `mm_struct` while bypassing the current default
allocation path.
Relevant files:
- `poc_futex_priv_hash_clonevm.c`
- `kasan_uaf_futex_hash_put_excerpt.log`
- `kernel.config`
## PoC update note
The attached PoC now opens userfaultfd with `UFFD_USER_MODE_ONLY`, so
an unprivileged run no longer depends on
`/proc/sys/vm/unprivileged_userfaultfd` or `CAP_SYS_PTRACE`.
Attachment:
kasan_uaf_futex_hash_put_excerpt.log
Description: Binary data
Attachment:
poc_futex_priv_hash_clonevm.c
Description: Binary data
Attachment:
kernel.config
Description: Binary data