[PATCH RFC 0/7] buffered block atomic writes
From: John Garry
Date: Mon Apr 22 2024 - 10:43:17 EST
This series introduces a proof-of-concept for buffered block atomic
writes.
There is a requirement for userspace to be able to issue a write which
will not be torn due to HW or some other failure. A solution is presented
in [0] and [1].
Those series mentioned only support atomic writes for direct IO. The
primary target of atomic (or untorn) writes is DBs like InnoDB/MySQL,
which require direct IO support. However, as mentioned in [2], there is
a want to support atomic writes for DBs which use buffered writes, like
Postgres.
The issue raised in [2] was that the API proposed is not suitable for
buffered atomic writes. Specifically, since the API permits a range of
sizes of atomic writes, it is too difficult to track in the pagecache the
geometry of atomic writes which overlap with other atomic writes of
differing sizes and alignment. In addition, tracking and handling
overlapping atomic and non-atomic writes is difficult also.
In this series, buffered atomic writes are supported based upon the
following principles:
- A buffered atomic write requires RWF_ATOMIC flag be set, same as
direct IO. The same other atomic writes rules apply, like power-of-2
size and naturally aligned.
- For an inode, only a single size of buffered write is allowed. So for
statx, atomic_write_unit_min = atomic_write_unit_max always for
buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. Folios match
atomic writes well, as an atomic write must be a power-of-2 in size and
naturally aligned.
- A folio is tagged as "atomic" when atomically written. If any part of an
"atomic" folio is fully or partially overwritten with a non-atomic
write, the folio loses it atomicity. Indeed, issuing a non-atomic write
over an atomic write would typically be seen as a userspace bug.
- If userspace wants to guarantee a buffered atomic write is written to
media atomically after the write syscall returns, it must use RWF_SYNC
or similar (along with RWF_ATOMIC).
This series just supports buffered atomic writes for XFS. I do have some
patches for bdev file operations buffered atomic writes. I did not include
them, as:
a. I don't know of any requirement for this support
b. atomic_write_unit_min and atomic_write_unit_max would be fixed at
PAGE_SIZE there. This is very limiting. However an API like BLKBSZSET
could be added to allow userspace to program the values for
atomic_write_unit_{min, max}.
c. We may want to support atomic_write_unit_{min, max} < PAGE_SIZE, and
this becomes more complicated to support.
d. I would like to see what happens with bs > ps work there.
This series is just an early proof-of-concept, to prove that the API
proposed for block atomic writes can work for buffered IO. I would like to
unblock that direct IO series and have it merged.
Patches are based on [0], [1], and [3] (the bs > ps series). For the bs >
ps series, I had to borrow an earlier filemap change which allows the
folio min and max order be selected.
All patches can be found at:
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.9-v6-fs-v2-buffered
[0] https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@xxxxxxxxxx/
[1] https://lore.kernel.org/linux-block/20240304130428.13026-1-john.g.garry@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@xxxxxxx/
[3] https://lore.kernel.org/linux-xfs/20240313170253.2324812-1-kernel@xxxxxxxxxxxxxxxx/
John Garry (7):
fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
filemap: Change mapping_set_folio_min_order() ->
mapping_set_folio_orders()
mm: Add PG_atomic
fs: Add initial buffered atomic write support info to statx
fs: iomap: buffered atomic write support
fs: xfs: buffered atomic writes statx support
fs: xfs: Enable buffered atomic writes
block/bdev.c | 9 +++---
fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++-----
fs/iomap/trace.h | 3 +-
fs/stat.c | 26 ++++++++++++-----
fs/xfs/libxfs/xfs_inode_buf.c | 8 +++++
fs/xfs/xfs_file.c | 12 ++++++--
fs/xfs/xfs_icache.c | 10 ++++---
fs/xfs/xfs_ioctl.c | 3 ++
fs/xfs/xfs_iops.c | 11 +++++--
include/linux/fs.h | 3 +-
include/linux/iomap.h | 1 +
include/linux/page-flags.h | 5 ++++
include/linux/pagemap.h | 20 ++++++++-----
include/trace/events/mmflags.h | 3 +-
include/uapi/linux/stat.h | 6 ++--
mm/filemap.c | 8 ++++-
16 files changed, 141 insertions(+), 40 deletions(-)
--
2.31.1