Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()

From: Paul E. McKenney
Date: Thu Dec 05 2024 - 16:17:16 EST


On Thu, Dec 05, 2024 at 09:15:14PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> > > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > > > > void fd_install(unsigned int fd, struct file *file)
> > > > > > > {
> > > > > > > - struct files_struct *files = current->files;
> > > > > > > + struct files_struct *files;
> > > > > > > struct fdtable *fdt;
> > > > > > >
> > > > > > > if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > > > > return;
> > > > > > >
> > > > > > > + /*
> > > > > > > + * Synchronized with expand_fdtable(), see that routine for an
> > > > > > > + * explanation.
> > > > > > > + */
> > > > > > > rcu_read_lock_sched();
> > > > > > > + files = READ_ONCE(current->files);
> > > > > >
> > > > > > What are you trying to do with that READ_ONCE()? current->files
> > > > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > > > >
> > > > > To my understanding this is the idiomatic way of spelling out the
> > > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > > flag.
> > > >
> > > > In Linus, "smp_consume_load()" is named rcu_dereference().
> > >
> > > ok
> >
> > And rcu_dereference(), and for that matter memory_order_consume, only
> > orders the load of the pointer against subsequent dereferences of that
> > same pointer against dereferences of that same pointer preceding the
> > store of that pointer.
> >
> > T1 T2
> > a: p->a = 1; d: q = rcu_dereference(gp);
> > b: r1 = p->b; e: r2 = p->a;
> > c: rcu_assign_pointer(gp, p); f: p->b = 42;
> >
> > Here, if (and only if!) T2's load into q gets the value stored by
> > T1, then T1's statements e and f are guaranteed to happen after T2's
> > statements a and b. In your patch, I do not see this pattern for the
> > files->resize_in_progress flag.
> >
> > > > > Anyway to elaborate I'm gunning for a setup where the code is
> > > > > semantically equivalent to having a lock around the work.
> > > >
> > > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > > > only with later RCU grace periods, such as those implemented by
> > > > synchronize_rcu().
> > >
> > > To my understanding the pre-case is already with the flag set upfront
> > > and waiting for everyone to finish (which is already taking place in
> > > stock code) + looking at it within the section.
> >
> > I freely confess that I do not understand the purpose of assigning to
> > files->resize_in_progress both before (pre-existing) and within (added)
> > expand_fdtable(). If the assignments before and after the call to
> > expand_fdtable() and the checks were under that lock, that could work,
> > but removing that lockless check might have performance and scalability
> > consequences.
> >
> > > > > Pretend ->resize_lock exists, then:
> > > > > fd_install:
> > > > > files = current->files;
> > > > > read_lock(files->resize_lock);
> > > > > fdt = rcu_dereference_sched(files->fdt);
> > > > > rcu_assign_pointer(fdt->fd[fd], file);
> > > > > read_unlock(files->resize_lock);
> > > > >
> > > > > expand_fdtable:
> > > > > write_lock(files->resize_lock);
> > > > > [snip]
> > > > > rcu_assign_pointer(files->fdt, new_fdt);
> > > > > write_unlock(files->resize_lock);
> > > > >
> > > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > > > synchronize_rcu do it.
> > > >
> > > > OK, good, you did get the grace-period part of the puzzle.
> > > >
> > > > Howver, please keep in mind that synchronize_rcu() has significant
> > > > latency by design. There is a tradeoff between CPU consumption and
> > > > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > > > several milliseconds (not microseconds or nanoseconds). I would be very
> > > > surprised if expand_fdtable() users would be happy with such a long delay.
> > >
> > > The call is already there since 2015 and I only know of one case where
> > > someone took an issue with it (and it could have been sorted out with
> > > dup2 upfront to grow the table to the desired size). Amusingly I see
> > > you patched it in 2018 from synchronize_sched to synchronize_rcu.
> > > Bottom line though is that I'm not *adding* it. latency here. :)
> >
> > Are you saying that the smp_rmb() is unnecessary? It doesn't seem like
> > you are saying that, because otherwise your patch could simply remove
> > it without additional code changes. On the other hand, if it is a key
> > component of the synchronization, I don't see how that smp_rmb() can be
> > removed while still preserving that synchronization without adding another
> > synchronize_rcu() to that function to compensate.
> >
> > Now, it might be that you are somehow cleverly reusing the pre-existing
> > synchronize_rcu(), but I am not immediately seeing how this would work.
> >
> > And no, I do not recall making that particular change back in the
> > day, only that I did change all the calls to synchronize_sched() to
> > synchronize_rcu(). Please accept my apologies for my having failed
> > to meet your expectations. And do not be too surprised if others have
> > similar expectations of you at some point in the future. ;-)
> >
> > > So assuming the above can be ignored, do you confirm the patch works
> > > (even if it needs some cosmetic changes)?
> > >
> > > The entirety of the patch is about removing smp_rmb in fd_install with
> > > small code rearrangement, while relying on the machinery which is
> > > already there.
> >
> > The code to be synchronized is fairly small. So why don't you
> > create a litmus test and ask herd7? Please see tools/memory-model for
> > documentation and other example litmus tests. This tool does the moral
> > equivalent of a full state-space search of the litmus tests, telling you
> > whether your "exists" condition is always, sometimes, or never satisfied.
> >
>
> I think there is quite a degree of talking past each other in this thread.
>
> I was not aware of herd7. Testing the thing with it sounds like a plan
> to get out of it, so I'm going to do it and get back to you in a day
> or two. Worst case the patch is a bust, best case the fence is already
> of no use.

Very good! My grouchiness earlier in this thread notwithstanding, I am
happy to review your litmus tests. (You will likely need more than one.)

And the inevitable unsolicited advice: Make those litmus tests as small
as you possibly can. Full-state-space search is extremely powerful,
but it does not scale very well.

Thanx, Paul