Re: [GIT PULL] gfs2 fix

From: Andreas Gruenbacher
Date: Wed Apr 27 2022 - 17:26:46 EST


On Wed, Apr 27, 2022 at 10:26 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote:
> >
> > I wonder if this could be documented in the read and write manual
> > pages. Or would that be asking too much?
>
> I don't think it would be asking too much, since it's basically just
> describing what Linux has always done in all the major filesystems.
>
> Eg look at filemap_read(), which is basically the canonical read
> function, and note how it doesn't take a single lock at that level.
>
> We *do* have synchronization at a page level, though, ie we've always
> had that page-level "uptodate" bit, of course (ok, so "always" isn't
> true - back in the distant past it was the 'struct buffer_head' that
> was the synchronization point).
>
> That said, even that is not synchronizing against "new writes", but
> only against "new creations" (which may, of course, be writers, but is
> equally likely to be just reading the contents from disk).
>
> That said:
>
> (a) different filesystems can and will do different things.
>
> Not all filesystems use filemap_read() at all, and even the ones that
> do often have their own wrappers. Such wrappers *can* do extra
> serialization, and have their own rules. But ext4 does not, for
> example (see ext4_file_read_iter()).
>
> And as mentioned, I *think* XFS honors that old POSIX rule for
> historical reasons.
>
> (b) we do have *different* locking
>
> for example, we these days do actually serialize properly on the
> file->f_pos, which means that a certain *class* of read/write things
> are atomic wrt each other, because we actually hold that f_pos lock
> over the whole operation and so if you do file reads and writes using
> the same file descriptor, they'll be disjoint.
>
> That, btw, hasn't always been true. If you had multiple threads using
> the same file pointer, I think we used to get basically random
> results. So we have actually strengthened our locking in this area,
> and made it much better.
>
> But note how even if you have the same file descriptor open, and then
> do pread/pwrite, those can and will happen concurrently.
>
> And mmap accesses and modifications are obviously *always* concurrent,
> even if the fault itself - but not the accesses - might end up being
> serialized due to some filesystem locking implementation detail.
>
> End result: the exact serialization is complex, depends on the
> filesystem, and is just not really something that should be described
> or even relied on (eg that f_pos serialization is something we do
> properly now, but didn't necessarily do in the past, so ..)
>
> Is it then worth pointing out one odd POSIX rule that basically nobody
> but some very low-level filesystem people have ever heard about, and
> that no version of Linux has ever conformed to in the main default
> filesystems, and that no user has ever cared about?

Well, POSIX explicitly mentions those atomicity expectations, e.g.,
for read [1]:

"I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
means that all the bytes from a single operation that started out together
end up together, without interleaving from other I/O operations."

Users who hear about it from POSIX are led to assume that this atomic
behavior is "real", and the Linux man pages do nothing to rob them of
that illusion. They do document that the file offset aspect has been
fixed though, which only makes things more confusing.

So from that point of view, I think it would be worthwhile to mention
that most if not all filesystems ignore the "non-interleaving" aspect.

[1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/read.html
[2] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/write.html
[3] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/V2_chap02.html#tag_15_09_07

Thanks,
Andreas