Re: POSIX violation by writeback error
From: Jeff Layton
Date: Tue Sep 25 2018 - 12:41:25 EST
On Tue, 2018-09-25 at 11:46 -0400, Theodore Y. Ts'o wrote:
> On Tue, Sep 25, 2018 at 07:15:34AM -0400, Jeff Layton wrote:
> > Linux has dozens of filesystems and they all behave differently in this
> > regard. A catastrophic failure (paradoxically) makes things simpler for
> > the fs developer, but even on local filesystems isolated errors can
> > occur. It's also not just NFS -- what mostly started me down this road
> > was working on ENOSPC handling for CephFS.
> > I think it'd be good to at least establish a "gold standard" for what
> > filesystems ought to do in this situation. We might not be able to
> > achieve that in all cases, but we could then document the exceptions.
> I'd argue the standard should be the precedent set by AFS and NFS.
> AFS verifies space available on close(2) and returns ENOSPC from the
> close(2) system call if space is not available. At MIT Project
> Athena, where we used AFS extensively in the late 80's and early 90's,
> we made and contributed back changes to avoid data loss as a result of
> quota errors.
> The best practice that should be documented for userspace is when
> writing precious files, programs should open for writing foo.new, write
> out the data, call fsync() and check the error return, call close()
> and check the error return, and then call rename(foo.new, foo) and
> check the error return. Writing a library function which does this,
> and which also copies the ACL's and xattr's from foo to foo.new before
> the rename() would probably help, but not as much as we might think.
>  That is, editors writing source files, but not compilers and
> similar programs writing object files and other generated files.
> None of this is really all that new. We had the same discussion back
> during the O_PONIES controversy, and we came out in the same place.
> - Ted
> P.S. One thought: it might be cool if there was some way for
> userspace applications to mark files with "nuke if not closed" flag,
> such that if the system crashes, the file systems would automatically
> unlink the file after a reboot or if the process was killed or exits
> without an explicit close(2). For networked/remote file systems that
> supported this flag, after the client comes back up after a reboot, it
> could notify the server that all files created previously from that
> client should be unlinked.
> Unlike O_TMPFILE, this would require file system changes to support,
> so maybe it's not worth having something which automatically cleans up
> files that were in the middle of being written at the time of a system
> crash. (Especially since you can get most of the functionality by
> using some naming convention for files that in the process of being
> written, and then teach some program that is regularly scanning the
> entire file system, such as updatedb(2) to nuke the files from a cron
> job. It won't be as efficient, but it would be much easier to
That's all well and good, but still doesn't quite solve the main concern
with all of this. It's suppose we have this series of events:
open file r/w
write 1024 bytes to offset 0
<background writeback that fails>
read 1024 bytes from offset 0
Open, write and read are successful, and there was no fsync or close in
between them. Will that read reflect the result of the previous write or
The answer today is "it depends".
Jeff Layton <jlayton@xxxxxxxxxx>