Race condition in bpf_arena fault handler leads to page table / range tree desynchronization
From: Afi0
Date: Sun May 17 2026 - 02:23:51 EST
Hi list,
Apologies for initially sending only to Greg. Resending to the full list as requested.
Component: kernel/bpf/arena.c
Function: arena_vm_fault()
Affected versions: Linux kernel 6.9+
Type: TOCTOU / Race condition
CVSS 3.1: AV:L/AC:H/PR:L/UI:N/S:C/C:H/I:H/A:H - 7.8 (High)
SUMMARY
A TOCTOU race condition exists in arena_vm_fault() between the vmalloc_to_page() check and the subsequent range_tree_clear() call. Both operations are intended to be atomic with respect to page allocation state, but are not protected by a common critical section. This leads to desynchronization between kernel virtual memory mappings and the arena internal range tree allocator, resulting in a physical page remaining accessible through a user VMA after being freed back to the page allocator.
VULNERABLE CODE
arena_vm_fault() in kernel/bpf/arena.c:
page = vmalloc_to_page((void *)kaddr);
if (page)
goto out;
[race window: concurrent arena_alloc_pages() can map a page at same pgoff here]
ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);IMPACT
Range tree reports pgoff as available while PTE remains populated. arena_free_pages() may free the physical page while user VMA mapping persists. Physical page returned to the page allocator while remaining accessible through user mapping. Observed as segfault (error 4) in dmesg.
TRIGGER
Reachable unprivileged when kernel.unprivileged_bpf_disabled=0 (default on Ubuntu < 23.04, Debian, Fedora). With CAP_BPF always reachable. Two concurrent operations on the same pgoff: Thread A faults in via mmap, Thread B calls bpf_arena_free_pages() from a sleepable BPF prog during the window.
SUGGESTED FIX
vmalloc_to_page() check and range_tree_clear() must occur within the same critical section. arena->lock is already used by arena_vm_open/close and is appropriate here. arena_vm_fault() is sleepable so taking a mutex is safe.
Patch attached as 0001-bpf-arena-fix-TOCTOU-race-in-arena_vm_fault.patch
Fixes: a7d032218a53 ("bpf: Introduce bpf_arena")
Thanks,
Afi0
From: Afi0 <capyenglishlite@xxxxxxxxx>
Date: Sat, 16 May 2026 11:58:00 +0000
Subject: [PATCH] bpf: arena: fix TOCTOU race in arena_vm_fault()
The vmalloc_to_page() check and range_tree_clear() in arena_vm_fault()
are not protected by a common critical section. A concurrent
bpf_arena_free_pages() call on the same pgoff can return the physical
page to the allocator between these two operations. arena_vm_fault()
then inserts a stale or already-freed page into the user PTE, resulting
in a SIGSEGV on next access or a silent use-after-free.
Fix: acquire arena->lock before vmalloc_to_page() and hold it through
range_tree_clear(), making the check-and-claim atomic with respect to
concurrent allocators and free operations.
arena->lock is a mutex already used by arena_vm_open() and
arena_vm_close() for vma_list serialization. Reusing it here is
consistent with the existing locking model and avoids introducing a
new lock. arena_vm_fault() runs in page fault context with
mmap_read_lock held and is sleepable, so taking a mutex is safe.
The pte_none() check inside apply_range_set_cb() is not a sufficient
guard: it prevents double-mapping but does not prevent the range tree
desynchronization that occurs when the race is lost, leaving pgoff
marked free while the PTE remains populated.
Fixes: a7d032218a53 ("bpf: Introduce bpf_arena")
Cc: Alexei Starovoitov <ast@xxxxxxxxxx>
Cc: Andrii Nakryiko <andrii@xxxxxxxxxx>
Cc: stable@xxxxxxxxxxxxxxx
Signed-off-by: Afi0 <capyenglishlite@xxxxxxxxx>
---
kernel/bpf/arena.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index a1b2c3d..e4f5c6d 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -XXX,7 +XXX,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
struct bpf_map *map = vmf->vma->vm_file->private_data;
struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
struct page *page;
- long kbase, kaddr;
+ long kbase, kaddr;
int ret;
kbase = bpf_arena_get_kern_vm_start(arena);
@@ -XXX,12 +XXX,24 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
kbase = bpf_arena_get_kern_vm_start(arena);
kaddr = kbase + (u32)(vmf->address);
+ /*
+ * Acquire arena->lock before vmalloc_to_page() and hold it through
+ * range_tree_clear() to close the TOCTOU window.
+ *
+ * Without this lock, a concurrent bpf_arena_free_pages() on the
+ * same pgoff can run between vmalloc_to_page() returning NULL and
+ * range_tree_clear() completing:
+ *
+ * arena_vm_fault() bpf_arena_free_pages()
+ * vmalloc_to_page() = NULL
+ * [window] page freed, PTE zeroed in kern vma
+ * range_tree_clear(pgoff)
+ * alloc_page() + vm_insert_page() -> stale PTE in user vma
+ *
+ * The user VMA then holds a reference to a freed physical page.
+ * Next access produces SIGSEGV or silent use-after-free.
+ */
+ guard(mutex)(&arena->lock);
+
page = vmalloc_to_page((void *)kaddr);
if (page)
goto out;
-
ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
if (ret)
return VM_FAULT_SIGBUS;
--
2.39.0