Re: KCSAN: data-race in desc_read / prb_reserve_in_last

From: Petr Mladek

Date: Thu Mar 12 2026 - 07:18:12 EST

On Wed 2026-03-11 14:27:46, Petr Mladek wrote:
> On Wed 2026-03-11 11:31:08, Jianzhou Zhao wrote:
> >
> >
> > Subject: [BUG] printk: KCSAN: data-race in desc_read / prb_reserve_in_last
> >
> > Dear Maintainers,
> >
> > We are writing to report a KCSAN-detected data-race vulnerability in the Linux kernel. This bug was found by our custom fuzzing tool, RacePilot. The bug occurs in the printk ringbuffer system, dealing with concurrent modifications and unannotated reads of the data block logical limits. We observed this on the Linux kernel version 6.18.0-08691-g2061f18ad76e-dirty.
> >
> > Call Trace & Context
> > ==================================================================
> > BUG: KCSAN: data-race in desc_read / prb_reserve_in_last
> >
> > write to 0xffffffff869276a8 of 8 bytes by task 14248 on cpu 0:
> > data_realloc kernel/printk/printk_ringbuffer.c:1252 [inline]
> > prb_reserve_in_last+0x831/0xb20 kernel/printk/printk_ringbuffer.c:1529
> > vprintk_store+0x603/0x980 kernel/printk/printk.c:2283
> > vprintk_emit+0xfd/0x540 kernel/printk/printk.c:2412
> > vprintk_default+0x26/0x30 kernel/printk/printk.c:2451
> > vprintk+0x1d/0x30 kernel/printk/printk_safe.c:82
> > _printk+0x63/0x90 kernel/printk/printk.c:2461
> > disk_unlock_native_capacity block/partitions/core.c:520 [inline]
> > blk_add_partition block/partitions/core.c:543 [inline]
> > blk_add_partitions block/partitions/core.c:633 [inline]
> > bdev_disk_changed block/partitions/core.c:693 [inline]
> > bdev_disk_changed+0xae3/0xeb0 block/partitions/core.c:642
> > loop_reread_partitions+0x44/0xc0 drivers/block/loop.c:449
> > loop_set_status+0x41c/0x580 drivers/block/loop.c:1278
> > loop_set_status64 drivers/block/loop.c:1374 [inline]
> > lo_ioctl+0xf0/0x1170 drivers/block/loop.c:1560
> > blkdev_ioctl+0x377/0x420 block/ioctl.c:707
> > vfs_ioctl fs/ioctl.c:52 [inline]
> > __do_sys_ioctl fs/ioctl.c:605 [inline]
> > __se_sys_ioctl fs/ioctl.c:591 [inline]
> > __x64_sys_ioctl+0x121/0x170 fs/ioctl.c:591
> > x64_sys_call+0xc3a/0x2030 arch/x86/include/generated/asm/syscalls_64.h:17
> > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> > do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > read to 0xffffffff869276a0 of 16 bytes by task 3004 on cpu 1:
> > desc_read+0x115/0x250 kernel/printk/printk_ringbuffer.c:499
> > desc_read_finalized_seq+0x40/0x140 kernel/printk/printk_ringbuffer.c:1972
> > prb_read kernel/printk/printk_ringbuffer.c:2018 [inline]
> > _prb_read_valid+0xc1/0x550 kernel/printk/printk_ringbuffer.c:2213
> > prb_read_valid_info+0x74/0xa0 kernel/printk/printk_ringbuffer.c:2321
> > devkmsg_poll+0xa1/0x120 kernel/printk/printk.c:906
> > vfs_poll include/linux/poll.h:82 [inline]
> > ep_item_poll.isra.0+0xb0/0x110 fs/eventpoll.c:1059
> > ep_send_events+0x231/0x670 fs/eventpoll.c:1818
> > ep_try_send_events fs/eventpoll.c:1905 [inline]
> > ep_poll fs/eventpoll.c:1970 [inline]
> > do_epoll_wait+0x2a8/0x9c0 fs/eventpoll.c:2461
> > __do_sys_epoll_wait fs/eventpoll.c:2469 [inline]
> > __se_sys_epoll_wait fs/eventpoll.c:2464 [inline]
> > __x64_sys_epoll_wait+0xcb/0x190 fs/eventpoll.c:2464
> > x64_sys_call+0x194e/0x2030 arch/x86/include/generated/asm/syscalls_64.h:233
> > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
> > do_syscall_64+0xae/0x2c0 arch/x86/entry/syscall_64.c:94
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >
> > Reported by Kernel Concurrency Sanitizer on:
> > CPU: 1 UID: 0 PID: 3004 Comm: systemd-journal Not tainted 6.18.0-08691-g2061f18ad76e-dirty #50 PREEMPT(voluntary)
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> > ==================================================================
> >
> > Execution Flow & Code Context
> > The writer task executes `prb_reserve_in_last()`, which concurrently reallocates and extends space for the newest record data by calling `data_realloc()`. Inside `data_realloc()`, it updates `blk_lpos->next` (which points to `desc->text_blk_lpos.next`) with a plain write:
> > ```c
> > // kernel/printk/printk_ringbuffer.c
> > static char *data_realloc(struct printk_ringbuffer *rb, unsigned int size,
> > struct prb_data_blk_lpos *blk_lpos, unsigned long id)
> > {
> > ...
> > blk_lpos->next = next_lpos; // <-- Write
> > return &blk->data[0];
> > }
> > ```
> >
> > Concurrently, another task acting as a reader accesses the ringbuffer via `desc_read()` to copy the descriptor structure to a local copy `desc_out`. It uses an unsynchronized `memcpy()` to read the `text_blk_lpos` structure which includes the 16 bytes representing both `begin` and `next` positions:
> > ```c
> > // kernel/printk/printk_ringbuffer.c
> > static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> > unsigned long id, struct prb_desc *desc_out,
> > u64 *seq_out, u32 *caller_id_out)
> > {
> > ...
> > if (desc_out) {
> > memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
> > sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */ // <-- Lockless Read
> > }
>
> The race is actually expected, see the comment above this code:
>
> /*
> * Copy the descriptor data. The data is not valid until the
> * state has been re-checked. A memcpy() for all of @desc
> * cannot be used because of the atomic_t @state_var field.
> */
> if (desc_out) {
> memcpy(&desc_out->text_blk_lpos, &desc->text_blk_lpos,
> sizeof(desc_out->text_blk_lpos)); /* LMM(desc_read:C) */
> }
>
> The data are copied. But the reader is allowed to use
> them only when the state of the block is marked as finalized.
>
> And there is a long way between data_realloc() and
> setting the finalized state:
>
> 1. Writer calls prb_reserve_in_last() -> data_realloc().
> Note that the reserved size can only get increased here.
>
> 2. If prb_reserve_in_last() succeeded that writer copies
> new data into the extended buffer.
>
> 3. The record is finalized in two situations:
>
> a) When the appended text ended with newline '\n'
> and the writer calls prb_final_commit().
>
> b) When the writer calls pr_commit() and next
> record is already reserved.
>
> c) When another writer calls prb_reserve() and the last
> (extended) record has already been committed.
>
> So, the race should be safe.
>
> But I agree that some race is there. The reader might copy
> an inconsistent descriptor, get delayed by long NMI, and
> then see already finalized state.
>
> I mean the following:
>
> CPU0 CPU1
>
> prb_read()
> desc_read()
> prb_prb_reserve_in_last()
> data_realloc()
> memcpy(text_blk_lpos) blk_lpos->next = next_lpos
>
> Warning: memcpy copies inconsistent lpos !!!
>
> NMI enter
>
> # Writer appends data and finalize
> # the record
> prb_final_commit()
>
> NMI handler ends
>
> state_val = atomic_long_read(state_var);
> d_state = get_desc_state(id, state_val);
>
> BANG: The reader gets inconsistent blk_lpos and sees
> finalized state.
>
> On the positive side:
> ---------------------
>
> There are still some sanity checks, for example:
>
> static const char *get_data(struct prb_data_ring *data_ring,
> struct prb_data_blk_lpos *blk_lpos,
> unsigned int *data_size)
> {
> [...]
> *data_size = blk_lpos->next - blk_lpos->begin;
> [...]
> if (WARN_ON_ONCE(!data_check_size(data_ring, *data_size) || !*data_size))
> return NULL;
> ...
>
> They might catch races when the data_size is negative number.
>
> As a result, I am not sure if any race could even produce
> bigger *data_size which might point behind the log buffer.
>
>
> Catch the race:
> ---------------
>
> We still should try to catch the race.
>
> For example, desc_read() reads the state before and after
> reading the data. It might invalidate them when the states
> differ. Something like:
>
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> index 56c8e3d031f4..a5a456904b78 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -454,13 +454,13 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> struct printk_info *info = to_info(desc_ring, id);
> struct prb_desc *desc = to_desc(desc_ring, id);
> atomic_long_t *state_var = &desc->state_var;
> - enum desc_state d_state;
> + enum desc_state d_state_before, d_state_after;
> unsigned long state_val;
>
> /* Check the descriptor state. */
> state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
> - d_state = get_desc_state(id, state_val);
> - if (d_state == desc_miss || d_state == desc_reserved) {
> + d_state_before = get_desc_state(id, state_val);
> + if (d_state_before == desc_miss || d_state_before == desc_reserved) {
> /*
> * The descriptor is in an inconsistent state. Set at least
> * @state_var so that the caller can see the details of
> @@ -546,11 +546,27 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
> * which may have changed since the load above.
> */
> state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
> - d_state = get_desc_state(id, state_val);
> + d_state_after = get_desc_state(id, state_val);
> +
> + /*
> + * The copied output data might be inconsistent when the state
> + * has changed in the meantime. Make the caller aware that
> + * they are not final.
> + *
> + * FIXME: We need to double check all callers ignore the copied
> + * data when desc_reserved is returned!
> + *
> + * A cleaner solution might be introducing another state
> + * for this.
> + *
> + * Also we might want to check/modify @state_val instead.
> + */
> + if (d_state_after != d_state_before)
> + d_state_after = desc_reserved;

As explained in the "FIXME: ..." comment above, the proposed approach
is complicated.

It might be much easier to use "goto try_again" and re-read the data
when "state_val" has changed when reading. It will make sure that
the read data are in sync with the returned state.

> out:
> if (desc_out)
> atomic_long_set(&desc_out->state_var, state_val);
> - return d_state;
> + return d_state_after;
> }

Something like [rfc v2]:

diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
index 56c8e3d031f4..c080edf5636a 100644
--- a/kernel/printk/printk_ringbuffer.c
+++ b/kernel/printk/printk_ringbuffer.c
@@ -454,11 +454,12 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
struct printk_info *info = to_info(desc_ring, id);
struct prb_desc *desc = to_desc(desc_ring, id);
atomic_long_t *state_var = &desc->state_var;
+ unsigned long state_val, state_val_check;
enum desc_state d_state;
- unsigned long state_val;

/* Check the descriptor state. */
state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */
+try_again:
d_state = get_desc_state(id, state_val);
if (d_state == desc_miss || d_state == desc_reserved) {
/*
@@ -466,7 +467,9 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
* @state_var so that the caller can see the details of
* the inconsistent state.
*/
- goto out;
+ if (desc_out)
+ atomic_long_set(&desc_out->state_var, state_val);
+ return d_state;
}

/*
@@ -542,14 +545,19 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring,
smp_rmb(); /* LMM(desc_read:D) */

/*
- * The data has been copied. Return the current descriptor state,
- * which may have changed since the load above.
+ * The data has been copied. Make sure that they match the given @id
+ * and @d_state read at the beginning. Re-read them otherwise.
+ *
+ * Note that this simplifies the logic. Especially, the reader API
+ * expects that all values are stable when the record is finalized.
+ * But they are stable only when it stays finalized all the time.
*/
- state_val = atomic_long_read(state_var); /* LMM(desc_read:E) */
- d_state = get_desc_state(id, state_val);
-out:
- if (desc_out)
- atomic_long_set(&desc_out->state_var, state_val);
+ state_val_check = atomic_long_read(state_var); /* LMM(desc_read:E) */
+ if (state_val_check != state_val) {
+ state_val = state_val_check;
+ goto try_again;
+ }
+
return d_state;
}

> /*
>
> Also we might add more consistency checks when reading the data.
> For example:
>
> diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c
> index 56c8e3d031f4..e252678337f6 100644
> --- a/kernel/printk/printk_ringbuffer.c
> +++ b/kernel/printk/printk_ringbuffer.c
> @@ -1319,6 +1335,12 @@ static const char *get_data(struct prb_data_ring *data_ring,
> /* Subtract block ID space from size to reflect data size. */
> *data_size -= sizeof(db->id);
>
> + /* Sanity check that we stay inside the printk data buffer */
> + if (WARN_ON_ONCE(&db->data[0] < data_ring->data ||
> + &db->data[0] + *data_size >= data_ring->data + DATA_SIZE(data_ring)))

I added some debug output:

pr_err_once("Printk RB begin: 0x%llx vs. 0x%llx\n",
&db->data[0], data_ring->data);
pr_err_once("Printk RB end: 0x%llx vs. 0x%llx\n",
&db->data[0] + *data_size,
data_ring->data + DATA_SIZE(data_ring));
pr_err_once("Printk RB size: 0x%x vs. 0x%x\n",
*data_size, DATA_SIZE(data_ring));

and WARN_ON_ONCE() triggeres when the data fills the very end of the
buffer, for example:

[ 69.349012] ------------[ cut here ]------------
[ 69.360993] &db->data[0] < data_ring->data || &db->data[0] + *data_size >= data_ring->data + DATA_SIZE(data_ring)
[ 69.360996] WARNING: kernel/printk/printk_ringbuffer.c:1331 at get_data+0x153/0x1a0, CPU#0: kunit_try_catch/1630
[...]
69.424288] ---[ end trace 0000000000000000 ]---
[ 69.424292] Printk RB begin: 0xffffffffc04a8fd8 vs. 0xffffffffc04a7040
[ 69.424294] Printk RB end: 0xffffffffc04a9040 vs. 0xffffffffc04a9040
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^
[ 69.424297] Printk RB size: 0x68 vs. 0x2000

And obviously, there should be s|>=|>|, aka

+ if (WARN_ON_ONCE(&db->data[0] < data_ring->data ||
+ &db->data[0] + *data_size > data_ring->data + DATA_SIZE(data_ring)))

We are comparing size vs size...

I have never seen the warning anymore with the ">" check.

> + return NULL;
> +
> +
> return &db->data[0];
> }

Best Regards,
Petr