Re: [PATCH] btrfs: wait for ordered extents before buffered write fallback in direct IO

From: Qu Wenruo

Date: Thu Jun 25 2026 - 01:14:03 EST

在 2026/6/25 11:44, Yun Zhou 写道:

When btrfs_direct_write() falls back to buffered IO after a failed DIO
attempt, it may race with the asynchronous completion of DIO ordered
extents. This leads to a BUG_ON in insert_ordered_extent() due to
overlapping ordered extents in the per-inode rb-tree.

The race sequence is:
1. DIO creates an ordered extent via btrfs_dio_iomap_begin()
2. Page fault occurs (nofault=true), no bio is submitted (submitted=0)
3. btrfs_dio_iomap_end() truncates and finishes the OE asynchronously
via btrfs_finish_ordered_extent() which queues work
4. iomap returns 0, retry logic faults in pages and retries DIO
5. Second DIO attempt also fails, code reaches buffered: label
6. btrfs_buffered_write() dirties pages for the same range

btrfs_buffered_write()
|- copy_one_range()
|- lock_and_cleanup_extent_if_needed()
|- btrfs_start_ordered_extent()

So your explanation doesn't makes sense. As if there is the direct IO oe remaining, we will wait for that OE to complete.

There is still something missing.

7. btrfs_fdatawrite_range() triggers writeback
8. run_delalloc_nocow() -> fallback_to_cow() -> cow_file_range()
tries to insert a new ordered extent for the same file offset
9. The DIO ordered extent hasn't been removed from the rb-tree yet
(btrfs_finish_ordered_io running async in workqueue) -> BUG_ON

Fix this by waiting for any pending ordered extents in the target range
before starting the buffered write.

Reported-by: syzbot+ba2afde329fc27e3f22e@xxxxxxxxxxxxxxxxxxxxxxxxx
Closes: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
Fixes: acf9ed3a6c00 ("btrfs: retry faulting in the pages after a zero sized short direct write")
Signed-off-by: Yun Zhou <yun.zhou@xxxxxxxxxxxxx>
---
fs/btrfs/direct-io.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 460326d34143..e8ac9492844c 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -844,6 +844,7 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+ struct btrfs_ordered_extent *ordered;
loff_t pos;
ssize_t written = 0;
ssize_t written_buffered;
@@ -1025,6 +1026,29 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
}
pos = iocb->ki_pos;
+
+ /*
+ * The DIO path may have created ordered extent(s) that are still being
+ * processed asynchronously in a work queue. We must wait for them to
+ * be fully completed and removed from the rb-tree before doing a
+ * buffered write to the same or overlapping range; otherwise the
+ * buffered writeback path (run_delalloc_nocow -> fallback_to_cow ->
+ * cow_file_range) may try to insert a new ordered extent that conflicts
+ * with the still-pending DIO one, triggering a BUG_ON in
+ * insert_ordered_extent().
+ *
+ * This happens when DIO creates an ordered extent but has a short write
+ * (submitted < length in btrfs_dio_iomap_end()), which truncates and
+ * finishes the ordered extent asynchronously while we fall back to
+ * buffered IO for the same range.
+ */
+ while ((ordered = btrfs_lookup_ordered_range(BTRFS_I(inode),
+ (u64)(pos - written),
+ (u64)written + iov_iter_count(from))) != NULL) {
+ btrfs_start_ordered_extent(ordered);
+ btrfs_put_ordered_extent(ordered);
+ }
+
written_buffered = btrfs_buffered_write(iocb, from);
if (written_buffered < 0) {
ret = written_buffered;