Re: [PATCH v8 11/12] zram: fix crashes with cpu hotplug multistate

From: Luis Chamberlain
Date: Thu Oct 14 2021 - 20:23:05 EST


On Fri, Oct 15, 2021 at 07:52:04AM +0800, Ming Lei wrote:
> On Thu, Oct 14, 2021 at 01:24:32PM -0700, Luis Chamberlain wrote:
> > On Thu, Oct 14, 2021 at 10:11:46AM +0800, Ming Lei wrote:
> > > On Thu, Oct 14, 2021 at 09:55:48AM +0800, Ming Lei wrote:
> > > > On Mon, Sep 27, 2021 at 09:38:04AM -0700, Luis Chamberlain wrote:
> > >
> > > ...
> > >
> > > >
> > > > Hello Luis,
> > > >
> > > > Can you test the following patch and see if the issue can be addressed?
> > > >
> > > > Please see the idea from the inline comment.
> > > >
> > > > Also zram_index_mutex isn't needed in zram disk's store() compared with
> > > > your patch, then the deadlock issue you are addressing in this series can
> > > > be avoided.
> > > >
> > > >
> > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > > index fcaf2750f68f..3c17927d23a7 100644
> > > > --- a/drivers/block/zram/zram_drv.c
> > > > +++ b/drivers/block/zram/zram_drv.c
> > > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > > >
> > > > /* Make sure all the pending I/O are finished */
> > > > fsync_bdev(bdev);
> > > > - zram_reset_device(zram);
> > > >
> > > > pr_info("Removed device: %s\n", zram->disk->disk_name);
> > > >
> > > > del_gendisk(zram->disk);
> > > > +
> > > > + /*
> > > > + * reset device after gendisk is removed, so any change from sysfs
> > > > + * store won't come in, then we can really reset device here
> > > > + */
> > > > + zram_reset_device(zram);
> > > > +
> > > > blk_cleanup_disk(zram->disk);
> > > > kfree(zram);
> > > > return 0;
> > > > @@ -2073,7 +2079,12 @@ static int zram_remove_cb(int id, void *ptr, void *data)
> > > > static void destroy_devices(void)
> > > > {
> > > > class_unregister(&zram_control_class);
> > > > +
> > > > + /* hold the global lock so new device can't be added */
> > > > + mutex_lock(&zram_index_mutex);
> > > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > > + mutex_unlock(&zram_index_mutex);
> > > > +
> > >
> > > Actually zram_index_mutex isn't needed when calling zram_remove_cb()
> > > since the zram-control sysfs interface has been removed, so userspace
> > > can't add new device any more, then the issue is supposed to be fixed
> > > by the following one line change, please test it:
> > >
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index fcaf2750f68f..96dd641de233 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1985,11 +1985,17 @@ static int zram_remove(struct zram *zram)
> > >
> > > /* Make sure all the pending I/O are finished */
> > > fsync_bdev(bdev);
> > > - zram_reset_device(zram);
> > >
> > > pr_info("Removed device: %s\n", zram->disk->disk_name);
> > >
> > > del_gendisk(zram->disk);
> > > +
> > > + /*
> > > + * reset device after gendisk is removed, so any change from sysfs
> > > + * store won't come in, then we can really reset device here
> > > + */
> > > + zram_reset_device(zram);
> > > +
> > > blk_cleanup_disk(zram->disk);
> > > kfree(zram);
> > > return 0;
> >
> > Sorry but nope, the cpu multistate issue is still present and we end up
> > eventually with page faults. I tried with both patches.
>
> In theory disksize_store() can't come in after del_gendisk() returns,
> then zram_reset_device() should cleanup everything, that is the issue
> you described in commit log.
>
> We need to understand the exact reason why there is still cpuhp node
> left, can you share us the exact steps for reproducing the issue?
> Otherwise we may have to trace and narrow down the reason.

See my commit log for my own fix for this issue.

Luis