Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback

From: Amir Goldstein
Date: Wed Apr 10 2019 - 06:42:38 EST


On Wed, Apr 10, 2019 at 12:10 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> > On Tue, Apr 9, 2019 at 6:43 PM Jan Kara <jack@xxxxxxx> wrote:
> > > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > > > writeback") claims that sync_file_range(2) syscall was "created for
> > > > userspace to be able to issue background writeout and so waiting for
> > > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > > WB_SYNC_NONE.
> > > >
> > > > This claim is only partially true. Is is true for users that use the flag
> > > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > > the reason for changing to WB_SYNC_NONE writeback.
> > > >
> > > > However, that claim is not true for users that use that flag combination
> > > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}. Those users explicitly
> > > > requested to wait for in-flight IO as well as to writeback of dirty
> > > > pages. sync_file_range(2) man page describes this flag combintaion as
> > > > "write-for-data-integrity operation", although some may argue against
> > > > this definition.
> > > >
> > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > > writeback, to perform the range sync request.
> > > >
> > > > One intended use case for this API is creating a dependency between
> > > > a new file's content and its link into the namepsace without requiring
> > > > journal commit and flushing of disk volatile caches:
> > > >
> > > > fd = open(".", O_TMPFILE | O_RDWR);
> > > > write(fd, newdata, count);
> > > > sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > > > linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > > >
> > > > For many local filesystems, ext4 and xfs included, the sequence above
> > > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > > content or 'newfile' does not exists. For some applications, this
> > > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > > > than the effects of fdatasync(2).
> > >
> > > I agree that this paragraph is true but I don't want any userspace program
> > > rely on this. We've been through this when ext3 got converted to ext4 and
> > > it has caused a lot of complaints. Ext3 had provided rather strong data vs
> > > metadata ordering due to the way journalling was implemented. Then ext4
> > > came, implemented delayed allocation and somewhat changed how journalling
> > > works and suddently userspace programmers were caught by surprise their code
> > > working by luck stopped working. And they were complaining that when their
> > > code worked without fsync(2) before, it should work after it as well. So it
> > > took a lot of explaining that their applications are broken and Ted has
> > > also implemented some workarounds to make at least the most common mistakes
> > > silently fixed by the kernel most of the time.
> > >
> > > So I'm against providing 90% data integrity guarantees because there'll be
> > > many people who'll think they can get away with it and then complain when
> > > they won't once our implementation changes.
> > >
> > > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> > > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> > > write-for-data-integrity.
> >
> > OK. I do agree that manpage is misleading and that the language describing
> > this flag combination should be toned down. I do not object to adding more
> > and more disclaimers about this API not being a data integrity API.
>
> I don't think we need more disclaimers, I'd just reword that
> "write-for-data-integrity" bit.
>
> > But the requirement I have is a real life workload and fdatasync(2) is not at
> > all a viable option when application is creating many files per second.
> >
> > I need to export the functionality of data writeback to userspace.
> > Objecting to expose this functionality via the interface that has been
> > documented to expose it for years and used to expose it in the
> > past (until the Fixes commit) is a bit weird IMO, even though I do
> > completely relate to your concern.
> >
> > I don't mind removing the "intended use case" part of commit message
> > if that helps reducing the chance that people misunderstand
> > the guaranties of the API.
> >
> > Another thing I could do is introduce a new flag for sync_file_range()
> > that will really mean what I want it to mean (all data will be written back
> > and all related inode metadata changes will be committed to journal
> > before the next inode metadata change is committed). For the sake of
> > argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY.
>
> So there are two parts to your requirements:
>
> 1) Data writeback for all file pages has been submitted (+ completed).
> 2) What happens with related metadata.
>
> sync_file_range() is fine for 1) although I agree that WB_SYNC_NONE mode
> can still end up skipping some pages due to unrelated writeback / lock
> contention so having sync_file_range() mode doing WB_SYNC_ALL writeback
> makes some sense to me.
>

OK. does it make enough sense for you to ACK my patch if the
commit message was toned down to:

Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
writeback") claims that sync_file_range(2) syscall was "created for
userspace to be able to issue background writeout and so waiting for
in-flight IO is undesirable there" and changes the writeback (back) to
WB_SYNC_NONE.

This claim is only partially true. It is true for users that use the flag
SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
the reason for changing to WB_SYNC_NONE writeback.

However, that claim is not true for users that use that flag combination
SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.
Those users explicitly requested to wait for in-flight IO as well as to
writeback of dirty pages.

Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT
and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
writeback, to perform the range sync request.

Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE")
Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx>

> What happens with 2) is pretty much up to the filesystem. You are speaking
> about the journal but that is very much a filesystem specific detail.
> What if the filesystem doesn't have journal at all? E.g. ext2? We cannot
> really expose a generic API that silently fails to work based on fs
> internal details.
>
> I guess I fail to see what kind of guarantees from the fs you're looking
> for? When you speak about journal, I guess you're looking for some kind of
> ordering but it's hard to be sure about exact requirements. So can you
> please spell those out? Once we know requirements, we can speak about the
> API.
>

The requirement is quite easy to explain with an example, less easy to
generalize. I illustrated the requirement with the example:
O_TMPFILE;write();DATA_BARRIER;linkat()

The requirement of the DATA_BARRIER operation is that if the effects
of linkat() are observed after crash, then the effects of write() must also
be observed after crash.

At this point one may say that fdatasync(2) meets the requirements
of a DATA_BARRIER operation, so I need to introduce a non-requirement.
The non-requirement is that DATA_BARRIER operation does not need
to be a DATA_INTEGRITY operation, meaning that after crash, there
is no requirement to be able to find or access the written data.

Now it should be clear to filesystem developers why the DATA_BARRIER
requirement is easier to meet with far less performance penalty than
fdatasync(2) on xfs and ext4.

Here is a clumsy attempt to generalize the guaranty of such operation,
without referring to internal fs implementation details:
"Make sure that all file data hits persistent storage before the next
file metadata change hits persistent storage".

Does that make sense to you?
Is the requirement clear?

How does this sound for an API:
sync_file_range(fd, SYNC_FILE_RANGE_WRITE_DATA_BARRIER, 0, 0);

which individual filesystems could support or EOPNOTSUPP
via the ->fsync() file_operation?

Thanks,
Amir.