Re: [PATCH v2 2/4] cxl/mem: Fix synchronization mechanism for device removal vs ioctl operations
From: Dan Williams
Date: Tue Mar 30 2021 - 15:44:06 EST
On Tue, Mar 30, 2021 at 12:26 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Tue, Mar 30, 2021 at 12:00:23PM -0700, Dan Williams wrote:
>
> > > > > IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.
> > > >
> > > > This usage of srcu is functionally equivalent to replacing
> > > > srcu_read_lock() with down_read() and the shutdown path with:
> > >
> > > Sort of, but the rules for load/store under RCU are different than for
> > > load/store under a normal barriered lock. All the data is unstable for
> > > instance and minimially needs READ_ONCE.
> >
> > The data is unstable under the srcu_read_lock until the end of the
> > next rcu grace period, synchronize_rcu() ensures all active
> > srcu_read_lock() sections have completed.
>
> No, that isn't how I would phrase it. *any* write side data under RCU
> is always unstable by definition in the read side because the write
> side can always race with any reader. Thus one should use the RCU
> accessors to deal with that data race, and get some acquire/release
> semantics when pointer chasing (though this doesn't matter here)
>
> > Unless Paul and I misunderstood each other, this scheme of
> > synchronizing object state is also used in kill_dax(), and I put
> > that comment there the last time this question was raised. If srcu
> > was being used to synchronize the liveness of an rcu object like
> > @cxlm or a new ops object then I would expect rcu_dereference +
> > rcu_assign_pointer around usages of that object. The liveness of the
> > object in this case is handled by kobject reference, or inode
> > reference in the case of kill_dax() outside of srcu.
>
> It was probably a mis-understanding as I don't think Paul would say
> you should read data in thread A and write to it in B without using
> READ_ONCE/WRITE_ONCE or a stronger atomic to manage the data race.
>
> The LWN articles on the "big bad compiler" are informative here. You
> don't want the compiler to do a transformation where it loads
> state_in_sysfs multiple times and gets different answers. This is what
> READ_ONCE is aiming to prevent.
>
> Here is it just a boolean flag, and the flag is only cleared, so risks
> are low, but it still isn't a technically correct way to use RCU.
>
> (and yes the kernel is full of examples of not following the memory
> model strictly)
Ok, so this is the disagreement. Strict adherence to the model where
it does not matter in practice. In that sense this use is not
idiomatic, and the fact that we've spilled this many bytes in emails
back and forth is good enough reason to change it.
>
> > > > cdev_device_del(...);
> > > > down_write(...):
> > > > up_write(...);
> > >
> > > The lock would have to enclose the store to state_in_sysfs, otherwise
> > > as written it has the same data race problems.
> >
> > There's no race above. The rule is that any possible observation of
> > ->state_in_sysfs == 1, or rcu_dereference() != NULL, must be
> > flushed.
>
> It is not about the flushing.
Ok, it's not about the flushing it's about whether the store to
state_in_sysfs can leak past up_write(). If that store can leak then
neither arrangement of:
cdev_device_del(...);
down_write(...):
up_write(...);
...or:
down_write(...):
cdev_device_del(...);
up_write(...);
...is sufficient.