Re: [PATCH v2 2/4] cxl/mem: Fix synchronization mechanism for device removal vs ioctl operations

From: Dan Williams
Date: Tue Mar 30 2021 - 15:01:18 EST


On Tue, Mar 30, 2021 at 10:54 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> On Tue, Mar 30, 2021 at 10:31:15AM -0700, Dan Williams wrote:
> > On Tue, Mar 30, 2021 at 10:03 AM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Mar 30, 2021 at 09:05:29AM -0700, Dan Williams wrote:
> > >
> > > > > If you can't clearly point to the *data* under RCU protection it is
> > > > > being used wrong.
> > > >
> > > > Agree.
> > > >
> > > > The data being protected is the value of
> > > > dev->kobj.state_in_sysfs. The
> > >
> > > So where is that read under:
> > >
> > > + idx = srcu_read_lock(&cxl_memdev_srcu);
> > > + rc = __cxl_memdev_ioctl(cxlmd, cmd, arg);
> > > + srcu_read_unlock(&cxl_memdev_srcu, idx);
> > >
> > > ?
> >
> > device_is_registered() inside __cxl_memdev_ioctl().
>
> Oh, I see, I missed that
>
> > > It can't read the RCU protected data outside the RCU critical region,
> > > and it can't read/write RCU protected data without using the helper
> > > macros which insert the required barriers.
> >
> > The required barriers are there. srcu_read_lock() +
> > device_is_registered() is paired with cdev_device_del() +
> > synchronize_rcu().
>
> RCU needs barriers on the actual load/store just a naked
> device_is_registered() alone is not strong enough.
>
> > > IMHO this can't use 'dev->kobj.state_in_sysfs' as the RCU protected data.
> >
> > This usage of srcu is functionally equivalent to replacing
> > srcu_read_lock() with down_read() and the shutdown path with:
>
> Sort of, but the rules for load/store under RCU are different than for
> load/store under a normal barriered lock. All the data is unstable for
> instance and minimially needs READ_ONCE.

The data is unstable under the srcu_read_lock until the end of the
next rcu grace period, synchronize_rcu() ensures all active
srcu_read_lock() sections have completed. Unless Paul and I
misunderstood each other, this scheme of synchronizing object state is
also used in kill_dax(), and I put that comment there the last time
this question was raised. If srcu was being used to synchronize the
liveness of an rcu object like @cxlm or a new ops object then I would
expect rcu_dereference + rcu_assign_pointer around usages of that
object. The liveness of the object in this case is handled by kobject
reference, or inode reference in the case of kill_dax() outside of
srcu.

>
> > cdev_device_del(...);
> > down_write(...):
> > up_write(...);
>
> The lock would have to enclose the store to state_in_sysfs, otherwise
> as written it has the same data race problems.

There's no race above. The rule is that any possible observation of
->state_in_sysfs == 1, or rcu_dereference() != NULL, must be flushed.
After that value transitions to zero, or the rcu object is marked for
deletion, an rcu grace period is needed before that memory can be
freed. If an rwsem is used the only requirement is that any read-side
sections that might have observed ->state_in_sysfs == 1 have ended
which is why the down_write() / up_write() does not need to surround
the cdev_device_del(). It's sufficient to flush the read side after
the state is known to have changed. There are several examples of
rwsem being used as a barrier like this:

drivers/mtd/ubi/wl.c:1432: down_write(&ubi->work_sem);
drivers/mtd/ubi/wl.c-1433- up_write(&ubi->work_sem);

drivers/scsi/cxlflash/main.c:2229: down_write(&cfg->ioctl_rwsem);
drivers/scsi/cxlflash/main.c-2230- up_write(&cfg->ioctl_rwsem);

fs/btrfs/block-group.c:355: down_write(&space_info->groups_sem);
fs/btrfs/block-group.c-356- up_write(&space_info->groups_sem);

fs/btrfs/disk-io.c:4189: down_write(&fs_info->cleanup_work_sem);
fs/btrfs/disk-io.c-4190- up_write(&fs_info->cleanup_work_sem);

net/core/net_namespace.c:629: down_write(&pernet_ops_rwsem);
net/core/net_namespace.c-630- up_write(&pernet_ops_rwsem);