[PATCH -RFC 2/2] ext4: avoid data corruption when extending DIO write race with buffered read

From: Baokun Li
Date: Sat Dec 02 2023 - 04:11:22 EST

Next message: Russell King (Oracle): "Re: [PATCH net] net: phylink: set phy_state interface when attaching SFP"
Previous message: Baokun Li: "[PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read"
In reply to: Baokun Li: "[PATCH -RFC 1/2] mm: avoid data corruption when extending DIO write race with buffered read"
Next in thread: Jan Kara: "Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The following race between extending DIO write and buffered read may
result in reading a stale page cache:

cpu1 cpu2
------------------------------|-----------------------------
// Direct write 1024 from 4096
// Buffer read 8192 from 0
... ...
ext4_file_write_iter
ext4_dio_write_iter
iomap_dio_rw
...
ext4_file_read_iter
generic_file_read_iter
filemap_read
i_size_read(inode) // 4096
filemap_get_pages
...
ext4_mpage_readpages
ext4_readpage_limit(inode)
i_size_read(inode) // 4096
// read 4096, zero-filled 4096
ext4_dio_write_end_io
i_size_write(inode, 5120)
i_size_read(inode) // 5120
copyout 4096

// new read 4096 from 4096
ext4_file_read_iter
generic_file_read_iter
filemap_read
i_size_read(inode) // 5120
filemap_get_pages
// stale page is uptodata
i_size_read(inode) // 5120
copyout 5120
dio invalidate stale page cache

In the above race, after DIO write updates the inode size, but before
invalidate stale page cache, buffered read sees that the last read page
chche is still uptodata, and does not re-read it from the disk to copy
it directly to the user space, which results in the data in the tail of
1024 bytes is not the same as the data on the disk.

To get around this, we wait for the existing DIO write to invalidate the
stale page cache before each new buffered read.

Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
---
fs/ext4/file.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0166bb9ca160..99e92ddef97d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -144,6 +144,9 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
if (iocb->ki_flags & IOCB_DIRECT)
return ext4_dio_read_iter(iocb, to);

+ /* wait for stale page cache to be invalidated */
+ inode_dio_wait(inode);
+
return generic_file_read_iter(iocb, to);
}

--
2.31.1

Next message: Russell King (Oracle): "Re: [PATCH net] net: phylink: set phy_state interface when attaching SFP"
Previous message: Baokun Li: "[PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read"
In reply to: Baokun Li: "[PATCH -RFC 1/2] mm: avoid data corruption when extending DIO write race with buffered read"
Next in thread: Jan Kara: "Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]