Re: [PATCH] btrfs: wait for ordered extents before buffered write fallback in direct IO

From: Qu Wenruo

Date: Thu Jun 25 2026 - 01:17:59 EST




在 2026/6/25 14:43, Qu Wenruo 写道:


在 2026/6/25 11:44, Yun Zhou 写道:
When btrfs_direct_write() falls back to buffered IO after a failed DIO
attempt, it may race with the asynchronous completion of DIO ordered
extents.  This leads to a BUG_ON in insert_ordered_extent() due to
overlapping ordered extents in the per-inode rb-tree.

The race sequence is:
  1. DIO creates an ordered extent via btrfs_dio_iomap_begin()
  2. Page fault occurs (nofault=true), no bio is submitted (submitted=0)
  3. btrfs_dio_iomap_end() truncates and finishes the OE asynchronously
     via btrfs_finish_ordered_extent() which queues work
  4. iomap returns 0, retry logic faults in pages and retries DIO
  5. Second DIO attempt also fails, code reaches buffered: label
  6. btrfs_buffered_write() dirties pages for the same range

btrfs_buffered_write()
|- copy_one_range()
   |- lock_and_cleanup_extent_if_needed()
      |- btrfs_start_ordered_extent()

So your explanation doesn't makes sense. As if there is the direct IO oe remaining, we will wait for that OE to complete.

There is still something missing.

  7. btrfs_fdatawrite_range() triggers writeback
  8. run_delalloc_nocow() -> fallback_to_cow() -> cow_file_range()
     tries to insert a new ordered extent for the same file offset
  9. The DIO ordered extent hasn't been removed from the rb-tree yet
     (btrfs_finish_ordered_io running async in workqueue) -> BUG_ON

Fix this by waiting for any pending ordered extents in the target range
before starting the buffered write.

Reported-by: syzbot+ba2afde329fc27e3f22e@xxxxxxxxxxxxxxxxxxxxxxxxx
Closes: https://syzkaller.appspot.com/bug?extid=ba2afde329fc27e3f22e
Fixes: acf9ed3a6c00 ("btrfs: retry faulting in the pages after a zero sized short direct write")

And the fixes tag is also incorrect.

Without that commit, we will directly fallback to buffered write without retry faulting in the pages.

So by your explanation it will trigger the same problem, with or without that commit.

Signed-off-by: Yun Zhou <yun.zhou@xxxxxxxxxxxxx>
---
  fs/btrfs/direct-io.c | 24 ++++++++++++++++++++++++
  1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 460326d34143..e8ac9492844c 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -844,6 +844,7 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
      struct file *file = iocb->ki_filp;
      struct inode *inode = file_inode(file);
      struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+    struct btrfs_ordered_extent *ordered;
      loff_t pos;
      ssize_t written = 0;
      ssize_t written_buffered;
@@ -1025,6 +1026,29 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
      }
      pos = iocb->ki_pos;
+
+    /*
+     * The DIO path may have created ordered extent(s) that are still being
+     * processed asynchronously in a work queue.  We must wait for them to
+     * be fully completed and removed from the rb-tree before doing a
+     * buffered write to the same or overlapping range; otherwise the
+     * buffered writeback path (run_delalloc_nocow -> fallback_to_cow ->
+     * cow_file_range) may try to insert a new ordered extent that conflicts
+     * with the still-pending DIO one, triggering a BUG_ON in
+     * insert_ordered_extent().
+     *
+     * This happens when DIO creates an ordered extent but has a short write
+     * (submitted < length in btrfs_dio_iomap_end()), which truncates and
+     * finishes the ordered extent asynchronously while we fall back to
+     * buffered IO for the same range.
+     */
+    while ((ordered = btrfs_lookup_ordered_range(BTRFS_I(inode),
+                (u64)(pos - written),
+                (u64)written + iov_iter_count(from))) != NULL) {
+        btrfs_start_ordered_extent(ordered);
+        btrfs_put_ordered_extent(ordered);
+    }
+
      written_buffered = btrfs_buffered_write(iocb, from);
      if (written_buffered < 0) {
          ret = written_buffered;