[PATCH v6 0/4] block: enable RWF_DONTCACHE for block devices

From: Tal Zussman

Date: Thu May 14 2026 - 17:54:30 EST


Add support for using RWF_DONTCACHE with block devices.

Dropbehind pruning needs to be done in non-IRQ context, but block
devices complete writeback in IRQ context. To fix this, we defer
dropbehind invalidation to task context. Add infrastructure that lets
bi_end_io callbacks run from a worker, in two forms:

1. BIO_COMPLETE_IN_TASK, a bio flag the submitter sets when it knows
upfront that the callback needs task context, as in the dropbehind
writeback paths.

2. bio_complete_in_task(), a helper that callbacks can invoke from
bi_end_io() when the decision to defer is dynamic, as in iomap
fserror reporting.

These queue the bio to a per-CPU batch and schedule a delayed work item
to do bio completion.

Patch 1 adds the block layer task-context completion infrastructure,
with both the flag and the procedural helper. This builds on top of
suggestions by Matthew and Christoph: the procedural helper and
bio_in_atomic() come from Christoph's "bio completion in task
enhancements / experiments" series [1].

[Christoph, I put you down as Suggested-by for this patch. Let me know
if you'd like it to be Co-authored-by with your sign-off.]

Patch 2 wires BIO_COMPLETE_IN_TASK into iomap writeback for dropbehind
folios, removes IOMAP_IOEND_DONTCACHE, and removes the DONTCACHE
workqueue deferral from XFS.

Patch 3 sets up DONTCACHE support for buffer-head-based I/O by setting
BIO_COMPLETE_IN_TASK in submit_bh_wbc() for the CONFIG_BUFFER_HEAD
path.

Patch 4 enables RWF_DONTCACHE for block devices based on the previous
support. This support is useful for databases that operate on raw block
devices, among other userspace applications.

I tested this (with CONFIG_BUFFER_HEAD=y) for reads and writes on a
single block device on a VM, so results may be noisy.

Reads were tested on the root partition with a 45GB range (~2x RAM).
Writes were tested on a disabled swap parition (~1GB) in a memcg of size
244MB to force reclaim pressure.

Results:

===== READS (/dev/nvme0n1p2) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 1098.6 1609.0
2 1270.3 1506.6
3 1093.3 1576.5
4 1141.8 2393.9
5 1365.3 2793.8
6 1324.6 2065.9
7 879.6 1920.7
8 1434.1 1662.4
9 1184.9 1857.9
10 1166.4 1702.8
11 1161.4 1653.4
12 1086.9 1555.4
13 1198.5 1718.9
14 1111.9 1752.2
---- ------------ --------------
avg 1173.7 1828.8 (+56%)

==== WRITES (/dev/nvme0n1p3) =====
sec normal MB/s dontcache MB/s
---- ------------ --------------
1 692.4 9297.7
2 4810.8 9342.8
3 5221.7 2955.2
4 396.7 8488.3
5 7249.2 9249.3
6 6695.4 1376.2
7 122.9 9125.8
8 5486.5 9414.7
9 6921.5 8743.5
10 27.9 8997.8
---- ------------ --------------
avg 3762.5 7699.1 (+105%)

[1]: https://lore.kernel.org/all/20260409160243.1008358-1-hch@xxxxxx/

---
Changes in v6:
- Remove RFC tag
- Rebase on v7.1-rc3.
- 1/4: Revert to using a bio_list, per Jens.
- 1/4: Restructure and simplify work function loop.
- 1/4: Expose both the flag and procedural version, in order to allow
static and dynamic deferral decisions, per conversation with Matthew
and Christoph at LSFMM.
- 1/4: Use bio_in_atomic() predicate, per Christoph.
- 1/4: Use the CPU hot-unplug protocol from mm/vmstat.c, to take into
account use of delayed_work.
- 1/4: Mark the workqueue WQ_PERCPU.
- 1/4: Add comments.
- 3/4 and 4/4: Split into two patches, per Christoph.
- 3/4: Drop the cont_write_begin() change. Block devices don't go
through cont_write_begin(), so it was out of scope and was left over
from v1.
- Link to v5: https://lore.kernel.org/r/20260408-blk-dontcache-v5-0-0f080c20a96f@xxxxxxxxxxxx

Changes in v5:
- 1/3: Replace local_lock + bio_list with struct llist, per Dave.
- 1/3: Use delayed_work with 1-jiffie delay, per Dave.
- 1/3: Add dedicated workqueue to avoid deadlocks, per Christoph.
- 1/3: Restructure work function as do/while loop and only schedule work
originally when the list was previously empty, per Jens.
- 2/3: Delete IOMAP_IOEND_DONTCACHE and its NOMERGE entry, per Matthew
and Christoph.
- Link to v4: https://lore.kernel.org/r/20260325-blk-dontcache-v4-0-c4b56db43f64@xxxxxxxxxxxx

Changes in v4:
- 1/3: Move dropbehind deferral from folio-level to bio-level using
BIO_COMPLETE_IN_TASK, per Matthew and Jan.
- 1/3: Work function yields on need_resched() to avoid hogging the CPU,
per Jan.
- 2/3: New patch. Set BIO_COMPLETE_IN_TASK on iomap writeback bios for
DONTCACHE folios, removing the need for XFS-specific workqueue
deferral.
- 3/3: Set BIO_COMPLETE_IN_TASK in submit_bh_wbc() for buffer_head
path.
- 3/3: Update commit message to mention CONFIG_BUFFER_HEAD=n path.
- Link to v3: https://lore.kernel.org/r/20260227-blk-dontcache-v3-0-cd309ccd5868@xxxxxxxxxxxx

Changes in v3:
- 1/2: Convert dropbehind deferral to per-CPU folio_batches protected by
local_lock using per-CPU work items, to reduce contention, per Jens.
- 1/2: Call folio_end_dropbehind_irq() directly from
folio_end_writeback(), per Jens.
- 1/2: Add CPU hotplug dead callback to drain the departing CPU's folio
batch.
- 2/2: Introduce block_write_begin_iocb(), per Christoph.
- 2/2: Dropped R-b due to changes.
- Link to v2: https://lore.kernel.org/r/20260225-blk-dontcache-v2-0-70e7ac4f7108@xxxxxxxxxxxx

Changes in v2:
- Add R-b from Jan Kara for 2/2.
- Add patch to defer dropbehind completion from IRQ context via a work
item (1/2).
- Add initial performance numbers to cover letter.
- Link to v1: https://lore.kernel.org/r/20260218-blk-dontcache-v1-1-fad6675ef71f@xxxxxxxxxxxx

---
Tal Zussman (4):
block: add task-context bio completion infrastructure
iomap: use BIO_COMPLETE_IN_TASK for dropbehind writeback
buffer: add dropbehind writeback support
block: enable RWF_DONTCACHE for block devices

block/bio.c | 147 +++++++++++++++++++++++++++++++++++++++++++-
block/fops.c | 5 +-
fs/buffer.c | 19 +++++-
fs/iomap/ioend.c | 5 +-
fs/xfs/xfs_aops.c | 4 --
include/linux/bio.h | 32 ++++++++++
include/linux/blk_types.h | 1 +
include/linux/buffer_head.h | 3 +
include/linux/iomap.h | 5 +-
9 files changed, 206 insertions(+), 15 deletions(-)
---
base-commit: 695fee9be55747935d0a7b58f3d1fb83397a8b4f
change-id: 20260218-blk-dontcache-338133dd045e

Best regards,
--
Tal Zussman <tz2294@xxxxxxxxxxxx>