Re: [PATCH v2] ext4: fix race condition between buffer write and page_mkwrite
From: Baokun Li
Date: Mon Apr 15 2024 - 00:28:16 EST
On 2023/6/5 23:08, Jan Kara wrote:
On Mon 05-06-23 15:55:35, Matthew Wilcox wrote:
On Mon, Jun 05, 2023 at 02:21:41PM +0200, Jan Kara wrote:
On Mon 05-06-23 11:16:55, Jan Kara wrote:
Yeah, I agree, that is also the conclusion I have arrived at when thinking
about this problem now. We should be able to just remove the conversion
from ext4_page_mkwrite() and rely on write(2) or truncate(2) doing it when
growing i_size.
OK, thinking more about this and searching through the history, I've
realized why the conversion is originally in ext4_page_mkwrite(). The
problem is described in commit 7b4cc9787fe35b ("ext4: evict inline data
when writing to memory map") but essentially it boils down to the fact that
ext4 writeback code does not expect dirty page for a file with inline data
because ext4_write_inline_data_end() should have copied the data into the
inode and cleared the folio's dirty flag.
Indeed messing with xattrs from the writeback path to copy page contents
into inline data xattr would be ... interesting. Hum, out of good ideas for
now :-|.
Is it so bad? Now that we don't have writepage in ext4, only
writepages, it seems like we have a considerably more benign locking
environment to work in.
Well, yes, without ->writepage() it might be *possible*. But still rather
ugly. The problem is that in ->writepages() i_size is not stable. Thus also
whether the inode data is inline or not is not stable. We'd need inode_lock
for that but that is not easily doable in the writeback path - inode lock
would then become fs_reclaim unsafe...
Honza
Hi Honza!
Hi Ted!
Hi Matthew!
Long time later came back to this, because while discussing another similar
ABBA problem with Hou Tao, he mentioned VM_FAULT_RETRY, and then I
thought that this could be used to solve this problem as well.
The general idea is that if we see a file with inline data in
ext4_page_mkwrite(),
we release the mmap_lock and grab the inode_lock to convert the inline data,
and then return VM_FAULT_RETRY to retry to get the mmap_lock.
The code implementation is as follows, do you have any thoughts?
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 537803250ca9..e044c11c9cf6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6056,15 +6056,36 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
if (unlikely(IS_IMMUTABLE(inode)))
return VM_FAULT_SIGBUS;
+ /*
+ * The ext4 writeback code does not expect dirty page for a file with
+ * inline data, so inline data needs to be converted. But it needs to
+ * hold the inode_lock when converting, and trying to lock the inode
+ * while holding the mmap_lock may result in an ABBA deadlock. So here
+ * we release the mmap_lock for conversion and retry after conversion.
+ * Only one retry is allowed to avoid endless loops.
+ * Acquire xattr_sem to avoid race with inline data conversion.
+ */
+ down_read(&EXT4_I(inode)->xattr_sem);
+ if (ext4_has_inline_data(inode)) {
+ up_read(&EXT4_I(inode)->xattr_sem);
+
+ if (!fault_flag_allow_retry_first(vmf->flags))
+ return VM_FAULT_SIGBUS;
+
+ release_fault_lock(vmf);
+
+ inode_lock(inode);
+ ext4_convert_inline_data(inode);
+ inode_unlock(inode);
+ return VM_FAULT_RETRY;
+ }
+ up_read(&EXT4_I(inode)->xattr_sem);
+
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
filemap_invalidate_lock_shared(mapping);
- err = ext4_convert_inline_data(inode);
- if (err)
- goto out_ret;
-
/*
* On data journalling we skip straight to the transaction handle:
* there's no delalloc; page truncated will be checked later; the
--
With Best Regards,
Baokun Li
.