Re: [Report] Race Condition in text_poke_bp_batch/poke_int3_handler

From: Peter Zijlstra
Date: Tue Dec 03 2024 - 05:08:17 EST


On Tue, Dec 03, 2024 at 04:58:50PM +0900, Seohyeon Maeng wrote:

> [ 24.729808] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014

What QEMU version and setup are you using?

There have been QEMU bugs around there. Can you reproduce on real
hardware? Because I can't seem to trigger this...


> A kernel panic occurs when the following code is executed during live
> patching. In this scenario, an int3 trap can be triggered.
>
> static inline void perf_event_task_sched_out(struct task_struct *prev,
> struct task_struct *next)
> {
> [...]
> if (static_branch_unlikely(&perf_sched_events))
> __perf_event_task_sched_out(prev, next);
> }
>
> noinstr int poke_int3_handler(struct pt_regs *regs)
> {
> [...]
> desc = try_get_desc(); // [1]
> if (!desc)
> return 0;
> [...]
> if (unlikely(desc->nr_entries > 1)) {
> tp = __inline_bsearch(ip, desc->vec, desc->nr_entries,
> sizeof(struct text_poke_loc),
> patch_cmp);
> if (!tp)
> goto out_put;
> } else {
> tp = desc->vec;
> if (text_poke_addr(tp) != ip)
> goto out_put;
> }
> [...]
> out_put:
> put_desc();
> return ret;
> }
>
> During the Interrupt Handler (poke_int3_handler) processing, the patch
> function may be entered, resulting in an improper reference count
> (refcount). This can cause the reference count to be incorrectly set,
> and the bp_desc.vec and bp_desc.nr_entries are reinitialized, leading
> to a loss of critical information and subsequent failures in handling
> the int3 trap.
>
> static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
> {
> [...]
> lockdep_assert_held(&text_mutex);
>
> bp_desc.vec = tp;
> bp_desc.nr_entries = nr_entries;
>
> /*
> * Corresponds to the implicit memory barrier in try_get_desc() to
> * ensure reading a non-zero refcount provides up to date bp_desc data.
> */
> atomic_set_release(&bp_desc.refs, 1); // [2]
> [...]
> /*
> * Remove and wait for refs to be zero.
> */
> if (!atomic_dec_and_test(&bp_desc.refs)) // [3]
> atomic_cond_read_acquire(&bp_desc.refs, !VAL);
> [...]
> }
>
> As demonstrated above, bp_desc and its refcount can be modified while
> poke_int3_handler is executing, leading to unexpected behavior.
>
> Consider a scenario where two CPUs concurrently execute the sequence
> [1] -> [2] -> [3] -> [1], with overlapping operations on the reference
> count. When [3] is executed, the refcount may drop to zero. As a
> result, when [1] attempts to retrieve the descriptor, it fails,
> leading to a kernel panic.

I'm failing to see how this can happen. The text_poke_bp() caller should
hold text_mutex, there SHOULD be no concurrency on [2]/[3].

So there is a single CPU doing text_poke_bp():

mutex_lock(&text_mutex);
text_poke_bp_batch()
lockdep_assert_held(&text_mutex);
atomic_set_release(&bp_desc.refs, 1); [2]
smp_wmb();

poke-first-byte-to-INT3 [A]

text_poke_sync();

poke-tail-bytes

text_poke_sync();

poke-first-byte

text_poke_sync(); [B]

if (!atomic_dec_and_test(&bp_desc.refs)) [3]
atomic_cond_read_acquire(&bp_desc.refs, VAL);
mutex_unlock(&text_mutex);


The only concurrency is multiple CPUs hitting the INT3, which exists
between [A] and [B], and notably, in that range the reference count
should be very much >= 1.

And [3] very specifically waits for all pre-existing interrupt handlers
to complete; at point [B] the INT3 is gone and no new handlers can
possibly start.

The INT3 handler (poke_int3_handler()) had the following cases:

- the boring case, INT3 is observed right after A, it gets a ref, does
the emulation and completes before 3.

- the tail case, INT3 is observed somewhere before B, it gets a ref,
does the emulation but complets after B, in which case 3 will wait
for it.

Hmm, there *might* be an issue when:

- INT3 triggers right before B, poke_int3_handler()'s try_get_desc() is
delayed until after 3.

But that is not what you were describing, were you? I think that case is
made impossible by text_poke_sync() itself, that sends an IPI to all
CPUs, completion of that IPI would block on the completion of the INT3
which triggered right before B.

And after the sync-IPI that CPU must not observe INT3 anymore.


If you really think there is a problem here, please describe the code
flow in more detail. But given I can't trigger anything, nor actually
see a hole in the code, I'm going to assume you managed to tickle the
QEMU bug.