Re: [RFC PATCH 0/2] Implement concurrent buffered write with folio lock

From: Chi Zhiling
Date: Thu May 01 2025 - 19:58:47 EST


On 2025/5/1 07:45, Dave Chinner wrote:
On Wed, Apr 30, 2025 at 05:03:51PM +0800, Chi Zhiling wrote:
On 2025/4/30 10:05, Dave Chinner wrote:
On Fri, Apr 25, 2025 at 06:38:39PM +0800, Chi Zhiling wrote:
From: Chi Zhiling <chizhiling@xxxxxxxxxx>

This is a patch attempting to implement concurrent buffered writes.
The main idea is to use the folio lock to ensure the atomicity of the
write when writing to a single folio, instead of using the i_rwsem.

I tried the "folio batch" solution, which is a great idea, but during
testing, I encountered an OOM issue because the locked folios couldn't
be reclaimed.

So for now, I can only allow concurrent writes within a single block.
The good news is that since we already support BS > PS, we can use a
larger block size to enable higher granularity concurrency.

I'm not going to say no to this, but I think it's a short term and
niche solution to the general problem of enabling shared buffered
writes. i.e. I expect that it will not exist for long, whilst

Hi, Dave,

Yes, it's a short-term solution, but it's enough for some scenarios.
I would also like to see better idea.

experience tells me that adding special cases to the IO path locking
has a fairly high risk of unexpected regressions and/or data
corruption....

I can't say there is definitely no data corruption, but I haven't seen
any new errors in xfstests.

Yeah, that's why they are "unexpected regressions" - testing looks
fine, but once it gets out into complex production workloads....

We might need to add some assertions in the code to check for the risk
of data corruption, not specifically for this patch, but for the current
XFS system in general. This would help developers avoid introducing new
bugs, similar to the lockdep tool.

I'm not sure what you invisage here or what problems you think we
might be able to catch - can you describe what you are thinking
about here?

I'm just say it casually.

I mean, is there a way to check for data corruption risks, rather than
waiting for it to happen and then reporting an error? Just like how
lockdep detects deadlock risks in advance.

I guess not.


These ideas come from previous discussions:
https://lore.kernel.org/all/953b0499-5832-49dc-8580-436cf625db8c@xxxxxxx/

In my spare time I've been looking at using the two state lock from
bcachefs for this because it looks to provide a general solution to
the issue of concurrent buffered writes.

In fact, I have tried the two state lock, and it does work quite well.
However, I noticed some performance degradation in single-threaded
scenarios in UnixBench (I'm not sure if it's caused by the memory
barrier).

Please share the patch - I'd like to see how you implemented it and
how you solved the various lock ordering and wider IO serialisation
issues. It may be that I've overlooked something and your
implementation makes me aware of it. OTOH, I might see something in
your patch that could be improved that mitigates the regression.

I think I haven't solved these problems.

The lock order I envisioned is IOLOCK -> TWO_STATE_LOCK -> MMAP_LOCK ->
ILOCK, which means that when releasing IOLOCK, TWO_STATE_LOCK should
also be released first, including when upgrading IOLOCK_SHARED to
IOLOCK_EXCL. However, I didn't do this.

I missed this part, and although I didn't encounter any issues in the
xfstests, this could indeed lead to a deadlock.


Besides this, is there anything else I have missed?



The patch is as follows, though it's not helpful

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3e7448c2a969..573e31bfef3f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -36,6 +36,17 @@

static const struct vm_operations_struct xfs_file_vm_ops;

+#define TWO_STATE_LOCK(ip, state) \
+ xfs_two_state_lock(&ip->i_write_lock, state)
+
+#define TWO_STATE_UNLOCK(ip, state) \
+ xfs_two_state_unlock(&ip->i_write_lock, state)
+
+#define buffered_lock(inode) TWO_STATE_LOCK(inode, 0)
+#define buffered_unlock(inode) TWO_STATE_UNLOCK(inode, 0)
+#define direct_lock(inode) TWO_STATE_LOCK(inode, 1)
+#define direct_unlock(inode) TWO_STATE_UNLOCK(inode, 1)
+
/*
* Decide if the given file range is aligned to the size of the fundamental
* allocation unit for the file.
@@ -263,7 +274,10 @@ xfs_file_dio_read(
ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
if (ret)
return ret;
+ direct_lock(ip);
ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, NULL, 0);
+ direct_unlock(ip);
+
xfs_iunlock(ip, XFS_IOLOCK_SHARED);

return ret;
@@ -598,9 +612,13 @@ xfs_file_dio_write_aligned(
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
+
+ direct_lock(ip);
trace_xfs_file_direct_write(iocb, from);
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
&xfs_dio_write_ops, 0, NULL, 0);
+ direct_unlock(ip);
+
out_unlock:
if (iolock)
xfs_iunlock(ip, iolock);
@@ -676,9 +694,11 @@ xfs_file_dio_write_unaligned(
if (flags & IOMAP_DIO_FORCE_WAIT)
inode_dio_wait(VFS_I(ip));

+ direct_lock(ip);
trace_xfs_file_direct_write(iocb, from);
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
&xfs_dio_write_ops, flags, NULL, 0);
+ direct_unlock(ip);

/*
* Retry unaligned I/O with exclusive blocking semantics if the DIO
@@ -776,9 +796,11 @@ xfs_file_buffered_write(
if (ret)
goto out;

+ buffered_lock(ip);
trace_xfs_file_buffered_write(iocb, from);
ret = iomap_file_buffered_write(iocb, from,
&xfs_buffered_write_iomap_ops);
+ buffered_unlock(ip);

/*
* If we hit a space limit, try to free up some lingering preallocated
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 52210a54fe7e..a8bc8d9737c4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -114,6 +114,7 @@ xfs_inode_alloc(
spin_lock_init(&ip->i_ioend_lock);
ip->i_next_unlinked = NULLAGINO;
ip->i_prev_unlinked = 0;
+ two_state_lock_init(&ip->i_write_lock);

return ip;
}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b91aaa23ea1e..9a8c75feda16 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -8,6 +8,7 @@

#include "xfs_inode_buf.h"
#include "xfs_inode_fork.h"
+#include "xfs_lock.h"

/*
* Kernel only inode definitions
@@ -92,6 +93,8 @@ typedef struct xfs_inode {
spinlock_t i_ioend_lock;
struct work_struct i_ioend_work;
struct list_head i_ioend_list;
+
+ two_state_lock_t i_write_lock;
} xfs_inode_t;

static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)



Thanks


Since single-threaded bufferedio is still the primary read-write mode,
I don't want to introduce too much impact in single-threaded scenarios.

I mostly don't care that much about small single threaded
performance regressions anywhere in XFS if there is some upside for
scalability or performance. We've always traded off single threaded
performance for better concurrency and/or scalability in XFS (right
from the initial design choices way back in the early 1990s), so I
don't see why we'd treat a significant improvement in buffered IO
concurrency any differently.

-Dave.