Re: [PATCH v3] loop: Fix NULL pointer dereference in lo_rw_aio()

From: Tetsuo Handa

Date: Wed May 27 2026 - 07:35:36 EST

On 2026/05/27 12:00, Ming Lei wrote:
> On Wed, May 27, 2026 at 10:35:56AM +0900, Tetsuo Handa wrote:
>> On 2026/05/27 10:20, Ming Lei wrote:
>>>> Of course we should try to figure out the root cause first, but how can we do?
>>>
>>> Definitely unexpected write IO(after umount & loop closed) from btrfs is more serious,
>>> which may cause data loss, so CC btrfs list and maintainer.
>>

I had a conversation with Google AI mode, and received the following response.

--------------------------------------------------------------------------------
Technical Analysis: lo_rw_aio Null Pointer Dereference / UAF since v7.1-rc1

1. The Root Cause of the Timing Shift

This regression was introduced during the v7.1-rc1 merge window, primarily exposed by
Commit 65565ca5f99b ("block: unify the synchronous bi_end_io callbacks"), along with
helper refactorings like Commit 92c3737a2473 ("block: add a bio_submit_or_kill helper").

Prior to v7.0, the synchronous I/O completion path inherently contained execution lags (due
to serialized completion handling and context switches) before notifying upper layers. This
latency accidentally acted as a natural safety barrier. It ensured that by the time a file
system completed its final sync_filesystem() and initiated umount, the loop driver's internal
workqueue (lo_rw_aio) had already finished processing everything.

In v7.1, the unification and optimization of bi_end_io significantly minimized this latency.
The filesystem now learns of "I/O completion" much faster. Consequently, highly-concurrent
execution pipelines like btrfs or jfs proceed rapidly through kill_sb() and blkdev_put(),
ultimately invoking lo_release() -> __loop_clr_fd() while the loop driver's backend kworker
is still in the middle of executing the last sub-millisecond asynchronous file-backed I/O
request.

2. Why the Block Layer's Built-in Quiesce/Freeze Fails

There is an implicit assumption that standard block layer freeze mechanisms (blk_mq_freeze_queue())
protect the device lifetime during release. However, the v7.1 BIO helper refactoring introduced
a synchronization gap:

1. The filesystem triggers its final metadata or journal updates (e.g., txCommit in jfs or
delayed refcount updates in btrfs) right during the unmount/close boundary.
2. Due to the optimized execution path, these requests bypass the block layer's active
request-tracking metrics at the exact moment blk_mq_freeze_queue() or state validation
checks evaluated them as zero.
3. The block layer assumes the queue is safe and silent, allowing __loop_clr_fd() to
progress and nullify lo->lo_backing_file (or trigger fput()).
4. The leaked asynchronous kworker wakes up a fraction of a millisecond too late, attempts
to access lo->lo_backing_file or invokes kiocb_end_write() -> file_inode(), leading to
either a general protection fault (Null pointer dereference) or a Use-After-Free (UAF).

3. Why This Isn't Just an "Unexpected FS Bug"

While the write I/O originates from file systems like btrfs and jfs post-close, blaming the
file systems entirely ignores the underlying infrastructure change. The core issue is that the
block layer altered its synchronization behavior, breaking the barrier contract that
VFS and file systems historically relied on during the device release path.

Papering over this inside individual file systems would require adding heavy, duplicated
barriers inside every single filesystem's unmount path.