Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure

From: Tal Zussman

Date: Fri May 22 2026 - 19:14:44 EST


On 5/14/26 5:51 PM, Tal Zussman wrote:

Responding to Sashiko review inline:

Link: https://sashiko.dev/#/patchset/20260514-blk-dontcache-v6-0-782e2fa7477b%40columbia.edu

> +static void bio_complete_work_fn(struct work_struct *w)
> +{
> + struct delayed_work *dw = to_delayed_work(w);
> + struct bio_complete_batch *batch =
> + container_of(dw, struct bio_complete_batch, work);
> +
> + while (1) {
> + struct bio_list list;
> + struct bio *bio;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + list = batch->list;
> + bio_list_init(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> +
> + if (bio_list_empty(&list))
> + break;
> +
> + while ((bio = bio_list_pop(&list)))
> + bio->bi_end_io(bio);

Q: "Does this code handle bios that do not have a completion callback?
If a bio is submitted with the BIO_COMPLETE_IN_TASK flag but has a NULL
bi_end_io pointer (like some flush or discard bios), executing
bio->bi_end_io(bio) directly will cause a NULL pointer dereference.
The synchronous completion path in bio_endio() checks if bio->bi_end_io is
non-NULL before calling it. Should a similar check be added here?"

A: Every bio that reaches this point should have bi_end_io set. There's no
point in deferring a bio without a completion handler.

Q: "Could this loop also monopolize the CPU and cause latency spikes?
Because the work uses a 1-jiffie delay, tens of thousands of bios might
accumulate under heavy I/O load. Processing the entire batch sequentially
without yielding or calling cond_resched() inside the inner loop could
lead to scheduler latency or watchdog warnings on non-preemptible kernels."

A: With PREEMPT_LAZY as the default, the worker is preemptible, as per
Hillf and Christoph, so this is unnecessary.

> + if (need_resched()) {
> + bool is_empty;
> +
> + local_lock_irq(&bio_complete_batch.lock);
> + is_empty = bio_list_empty(&batch->list);
> + local_unlock_irq(&bio_complete_batch.lock);
> + if (!is_empty)
> + mod_delayed_work_on(batch->cpu,
> + bio_complete_wq,
> + &batch->work, 0);
> + break;
> + }
> + }
> +}
> +
> +void __bio_complete_in_task(struct bio *bio)
> +{
> + struct bio_complete_batch *batch;
> + unsigned long flags;
> + bool was_empty;
> +
> + local_lock_irqsave(&bio_complete_batch.lock, flags);

Q: "Is it safe to use local_lock_irqsave() here when called from an atomic
context?
On CONFIG_PREEMPT_RT kernels, local_lock_t maps to a per-CPU spinlock_t,
which is a sleepable rt_mutex. Since __bio_complete_in_task() is specifically
called when bio_in_atomic() is true (which includes hardware interrupts or
execution under a raw_spinlock_t), attempting to acquire a sleepable lock
here would trigger an "Invalid wait context" lockdep warning.
Would a lockless list (llist) be more appropriate here to avoid sleeping
in atomic contexts?"

A: This seems legit, but I'm not super familiar with PREEMPT_RT. I don't want
to switch to raw_spinlock_t, as it seems like that would add unnecessary
overhead on non-PREEMPT_RT kernels. I think switching to use local_irq_save()
(as is done for the per-CPU bio allocation cache) should work.

> + batch = this_cpu_ptr(&bio_complete_batch);
> + was_empty = bio_list_empty(&batch->list);
> + bio_list_add(&batch->list, bio);
> + local_unlock_irqrestore(&bio_complete_batch.lock, flags);
> +
> + if (was_empty)
> + mod_delayed_work_on(batch->cpu, bio_complete_wq,
> + &batch->work, 1);
> +}
> +EXPORT_SYMBOL_GPL(__bio_complete_in_task);
> +
> static inline bool bio_remaining_done(struct bio *bio)
> {
> /*
> @@ -1791,7 +1865,9 @@ void bio_endio(struct bio *bio)
> }
> #endif
>
> - if (bio->bi_end_io)
> + if (bio_flagged(bio, BIO_COMPLETE_IN_TASK) && bio_in_atomic())
> + __bio_complete_in_task(bio);
> + else if (bio->bi_end_io)
> bio->bi_end_io(bio);
> }
> EXPORT_SYMBOL(bio_endio);
> @@ -1977,6 +2053,51 @@ int bioset_init(struct bio_set *bs,
> }
> EXPORT_SYMBOL(bioset_init);
>
> +static int bio_complete_batch_cpu_online(unsigned int cpu)
> +{
> + enable_delayed_work(&per_cpu(bio_complete_batch, cpu).work);
> + return 0;
> +}
> +

Q: "Could a CPU hotplug rollback strand deferred bios indefinitely?
During the early teardown phase (CPUHP_AP_ONLINE_DYN),
bio_complete_batch_cpu_down_prep() disables the delayed work. However, the
CPU remains active and can continue completing bios. When these bios
complete, __bio_complete_in_task() adds them to the list with
was_empty = true, but mod_delayed_work_on() fails to queue the work
because it is disabled.
If the CPU offline operation is aborted, bio_complete_batch_cpu_online()
re-enables the delayed work but does not explicitly queue it. Because the
list is no longer empty, any future bio completions will see
was_empty == false and skip calling mod_delayed_work_on(), permanently
stranding the bios."

A: This is legit. The solution is to requeue the work if the list isn't
empty on CPU online.

> +/*
> + * Disable this CPU's delayed work so that it cannot run on an unbound worker
> + * after the CPU is offlined.
> + */
> +static int bio_complete_batch_cpu_down_prep(unsigned int cpu)
> +{
> + disable_delayed_work_sync(&per_cpu(bio_complete_batch, cpu).work);
> + return 0;
> +}
> +
> +/*
> + * Drain a dead CPU's deferred bio completions. The CPU is dead and the worker
> + * is canceled so no locking is needed.
> + */
> +static int bio_complete_batch_cpu_dead(unsigned int cpu)
> +{
> + struct bio_complete_batch *batch =
> + per_cpu_ptr(&bio_complete_batch, cpu);
> + struct bio *bio;
> +
> + while ((bio = bio_list_pop(&batch->list)))
> + bio->bi_end_io(bio);
> +

Q: "Is it possible to hit the same NULL pointer dereference issue here if the
bio lacks a bi_end_io callback?"

A: Same as above.

> + return 0;
> +}
> +
> +static void __init bio_complete_batch_init(int cpu)
> +{
> + struct bio_complete_batch *batch =
> + per_cpu_ptr(&bio_complete_batch, cpu);
> +
> + bio_list_init(&batch->list);
> + INIT_DELAYED_WORK(&batch->work, bio_complete_work_fn);
> + batch->cpu = cpu;
> +
> + if (!cpu_online(cpu))
> + disable_delayed_work_sync(&batch->work);
> +}
> +
> static int __init init_bio(void)
> {
> int i;
> @@ -1991,6 +2112,30 @@ static int __init init_bio(void)
> SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> }
>
> + for_each_possible_cpu(i)
> + bio_complete_batch_init(i);
> +
> + bio_complete_wq = alloc_workqueue("bio_complete",
> + WQ_MEM_RECLAIM | WQ_PERCPU, 0);

Q: "Does creating this workqueue with WQ_MEM_RECLAIM break the local_lock_irq()
protection in bio_complete_work_fn()?
When a workqueue has WQ_MEM_RECLAIM, it spawns a global rescuer thread that
can execute per-CPU work items during worker pool congestion. This rescuer
thread executes unbound, meaning it could run on CPU B while processing
CPU A's work item.
Since local_lock operates strictly on the currently executing CPU, the
rescuer thread on CPU B would acquire CPU B's lock, while popping elements
from CPU A's list (derived via container_of()).
If an interrupt on CPU A concurrently calls __bio_complete_in_task(),
it will acquire CPU A's lock and modify the same list without mutual
exclusion, potentially causing list corruption."

A: The rescuer should run on the same CPU, not unbound, so this is not an
issue.

> + if (!bio_complete_wq)
> + panic("bio: can't allocate bio_complete workqueue\n");
> +
> + /*
> + * bio task-context completion draining on hot-unplugged CPUs:
> + *
> + * 1. Stop the per-CPU delayed work while the CPU is still online, so
> + * that it cannot run on an unbound worker later.
> + * 2. Drain leftover bios added between worker disabling and CPU
> + * offlining.
> + */
> + cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> + "block/bio:complete:online",
> + bio_complete_batch_cpu_online,
> + bio_complete_batch_cpu_down_prep);
> + cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN,
> + "block/bio:complete:dead",
> + NULL, bio_complete_batch_cpu_dead);
> +
> cpuhp_setup_state_multi(CPUHP_BIO_DEAD, "block/bio:dead", NULL,
> bio_cpu_dead);
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 97d747320b35..c0214d6c28d6 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -369,6 +369,38 @@ static inline struct bio *bio_alloc(struct block_device *bdev,
>
> void submit_bio(struct bio *bio);
>
> +/**
> + * bio_in_atomic - check if the current context is unsafe for bio completion
> + *
> + * Return: %true in atomic contexts (e.g. hard/soft IRQ, preempt-disabled);
> + * %false when a bio can be safely completed in the current context.
> + */
> +static inline bool bio_in_atomic(void)
> +{
> + if (IS_ENABLED(CONFIG_PREEMPTION) && rcu_preempt_depth())
> + return true;
> + if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
> + return true;

Q: "Will this cause an infinite loop of bio offloading on kernels with
CONFIG_PREEMPT_COUNT disabled?
Because bio_in_atomic() unconditionally returns true without preempt count
support, a dynamic call to bio_complete_in_task() from within a bi_end_io()
callback will always offload the bio to the workqueue.
When the workqueue executes bio->bi_end_io(bio), the callback will evaluate
bio_complete_in_task() again, which will return true again, creating a
permanent offloading loop."

A: Legit issue. This can be solved by changing bio_complete_in_task() to:

static inline bool bio_complete_in_task(struct bio *bio)
{
if (bio_flagged(bio, BIO_COMPLETE_IN_TASK))
return false;
if (!bio_in_atomic())
return false;
bio_set_flag(bio, BIO_COMPLETE_IN_TASK);
__bio_complete_in_task(bio);
return true;
}

We can use the BIO_COMPLETE_IN_TASK flag to indicate that it's already
been deferred to the workqueue as is safe to run.

> + return !preemptible();
> +}
> +
> +void __bio_complete_in_task(struct bio *bio);
> +
> +/**
> + * bio_complete_in_task - ensure a bio is completed in preemptible task context
> + * @bio: bio to complete
> + *
> + * If called from non-task context, offload the bio completion to a worker
> + * thread and return %true. Else return %false and do nothing.
> + */
> +static inline bool bio_complete_in_task(struct bio *bio)
> +{
> + if (!bio_in_atomic())
> + return false;
> + __bio_complete_in_task(bio);
> + return true;
> +}
> +
> extern void bio_endio(struct bio *);
>
> static inline void bio_io_error(struct bio *bio)
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 8808ee76e73c..d49d97a050d0 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -322,6 +322,7 @@ enum {
> BIO_REMAPPED,
> BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
> BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
> + BIO_COMPLETE_IN_TASK, /* complete bi_end_io() in task context */
> BIO_FLAG_LAST
> };
>
>