Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate

From: Minchan Kim
Date: Mon Mar 22 2021 - 18:12:49 EST


On Mon, Mar 22, 2021 at 08:41:56PM +0000, Luis Chamberlain wrote:
> On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote:
> > On Fri, Mar 19, 2021 at 07:09:24PM +0000, Luis Chamberlain wrote:
> > > Indeed one issue is a consequence of the other but a bit better
> > > description can be put together for both separately.
> > >
> > > The warning comes up when cpu hotplug detects that the callback
> > > is being removed, but yet "instances" for multistate are left behind,
> > > ie, not removed. CPU hotplug multistate allows us to dedicate a callback
> > > for zram so that it gets called every time a CPU hot plugs or unplugs.
> > > In the zram driver's case we want to trigger the callback per each
> > > struct zcomp, one is used per supported and used supported compression
> >
> > you meant "each one is used per supported compression"?
>
> Yup.
>
> > > algorithm. The zram driver allocates a struct zcomp for an compression
> > > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP
> > > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may
> > > have different zram devices, zram devices which use the same compression
> > > algorithm share the same struct zcomp. The "multi" in CPU hotplug multstate
> >
> > It allocates each own zcomp, not sharing even though zram devices
> > use the same compression algorithm.
>
> Right.
>
> > > refers to these different struct zcomp instances, each of these will
> > > have the callback routine registered run for it. The kernel's CPU
> > > hotplug multistate keeps a linked list of these different structures
> > > so that it will iterate over them on CPU transitions. By default at
> > > driver initialization we will create just one zram device (num_devices=1)
> > > and a zcomp structure then set for the lzo-rle comrpession algorithm.
> > > At driver removal we first remove each zram device, and so we destroy
> > > the struct zcomp. But since we expose sysfs attributes to create new
> > > devices or reset / initialize existing ones, we can easily end up
> > > re-initializing a struct zcomp before the exit routine of the module
> > > removes the cpu hotplug callback. When this happens the kernel's
> > > CPU hotplug will detect that at least one instance (struct zcomp for
> > > us) exists.
> >
> > It's very helpful to understand. Thanks a lot!S
> >
> > So IIUC, it's fundamentally race between destroy_devices(i.e., module
> > unload) and every sysfs knobs in zram. Isn't it?
>
> I would not call it *every* syfs knob as not all deal with things which
> are related to CPU hotplug multistate, right? Note that using just
> try_module_get() alone (that is the second patch only, does not fix the
> race I am describing above).

It wouldn't be CPU hotplug multistate issue but I'd like to call it
as more "zram instance race" bug.
What happens in this case?

CPU 1 CPU 2

destroy_devices
..
compact_store()
zram = dev_to_zram(dev);
idr_for_each(zram_remove_cb
zram_remove
..
kfree(zram)
down_read(&zram->init_lock);


CPU 1 CPU 2

hot_remove_store
compact_store()
zram = dev_to_zram(dev);
zram_remove
kfree(zram)
down_read(&zram->init_lock);

So, for me we need to close the zram instance create/removal
with sysfs rather than focusing on CPU hotplug issue.
Maybe, we could reuse zram_index_mutex with modifying it with
rw_semaphore. What do you think?

>
> There are two issues.
>
> > Below specific example is one of them and every sysfs code in zram
> > could access freed object(e.g., zram->something).
> > And you are claiming there isn't good way to fix it in kernfs(please
> > add it in the description, too) even though it's general problem.
>
> Correct, the easier route would have been through the block layer as
> its the one adding our syfs attributes for us but even it canno deal
> with this race on behalf of drivers given the currently exposed
> semantics on kernfs.
>
> > (If we had, we may detect the race and make zram_remove_cb fails
> > so unloading modulies fails, finally)
> >
> > So, shouldn't we introcude a global rw_semaphore?
> >
> > destroy_devices
> > class_unregister
> > down_write(&zram_unload);
> > idr_for_each(zram_remove_cb);
> > up_write(&zram_unload);
> >
> > And every sysfs code hold the lock with down_read in the first place
> > and release the lock right before return. So, introduce a zram sysfs
> > wrapper to centralize all of locking logic.
>
> Actually that does work but only if we use write lock attempts so to
> be able to knock two birds with one stone, so to address the deadlock
> with sysfs attribute removal. We're not asking politely to read some
> value off of a zram devices with these locks, we're ensuring to not
> run if our driver is going to be removed, so replacing the
> try_module_get() with something similar.
>
> > Does it make sense?
> >
> > >
> > > And so we can have:
> > >
> > > static void destroy_devices(void)
> > > {
> > > class_unregister(&zram_control_class);
> > > idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> > > zram_debugfs_destroy();
> > > idr_destroy(&zram_index_idr);
> > > unregister_blkdev(zram_major, "zram");
> > > cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> > > }
> > >
> > > While destroy_devices() runs we can have this race:
> > >
> > >
> > > CPU 1 CPU 2
> > >
> > > class_unregister(...);
> > > idr_for_each(...);
> > I think sysfs was already remove here.
> > > zram_debugfs_destroy();
> > > disksize_store(...)
> > > idr_destroy(...);
> > > unregister_blkdev(...);
> > > cpuhp_remove_multi_state(...);
> >
> > So,
> >
> > CPU 1 CPU 2
> >
> > destroy_devices
> > ..
> > disksize_store()
> > zcomp_create
> > zcomp_init
> > idr_for_each(zram_remove_cb
> > zram_remove
> > zram_reset
> > zcomp_destroy
> > cpuhp_state_remove_instance
> > cpuhp_state_add_instance
> > ..
> >
> > cpuhp_remove_multi_state(...)
> > Detect!
> > kernel: Error: Removing state 63 which has instances left.
>
> Yup true.
>
> > > The warning comes up on cpuhp_remove_multi_state() when it sees
> > > that the state for CPUHP_ZCOMP_PREPARE does not have an empty
> > > instance linked list.
> > >
> > > After the idr_destory() its also easy to run into a crash. The
> > > above predicament also leads to memory leaks.
> > >
> > > And so we need to have a state machine for when we're up and when
> > > we're going down generically.
> > >
> > > Let me know if it makes sense now, if so I can amend the commit log
> > > accordingly.
> >
> > Yub, It's much better. Let's have it in the commit log.
>
> Sure OK, well the following is what I end up with. With this approach
> only one patch is needed.
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a711a2e2a794..3b86ee94309e 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -41,6 +41,15 @@ static DEFINE_IDR(zram_index_idr);
> /* idr index must be protected */
> static DEFINE_MUTEX(zram_index_mutex);
>
> +/*
> + * Protects against:
> + * a) Hotplug cpu multistate races against compression algorithm removals
> + * and additions prior to its removal of the zram CPU hotplug callback
> + * b) Deadlock which is possible when sysfs attributes are used and we
> + * share a lock on removal.
> + */
> +DECLARE_RWSEM(zram_unload);
> +
> static int zram_major;
> static const char *default_compressor = CONFIG_ZRAM_DEF_COMP;
>
> @@ -1723,6 +1732,9 @@ static ssize_t disksize_store(struct device *dev,
> if (!disksize)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> down_write(&zram->init_lock);
> if (init_done(zram)) {
> pr_info("Cannot change disksize for initialized device\n");
> @@ -1748,6 +1760,7 @@ static ssize_t disksize_store(struct device *dev,
> zram->disksize = disksize;
> set_capacity_and_notify(zram->disk, zram->disksize >> SECTOR_SHIFT);
> up_write(&zram->init_lock);
> + up_write(&zram_unload);
>
> return len;
>
> @@ -1755,6 +1768,7 @@ static ssize_t disksize_store(struct device *dev,
> zram_meta_free(zram, disksize);
> out_unlock:
> up_write(&zram->init_lock);
> + up_write(&zram_unload);
> return err;
> }
>
> @@ -1773,14 +1787,17 @@ static ssize_t reset_store(struct device *dev,
> if (!do_reset)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> zram = dev_to_zram(dev);
> bdev = zram->disk->part0;
>
> mutex_lock(&bdev->bd_mutex);
> /* Do not reset an active device or claimed device */
> if (bdev->bd_openers || zram->claim) {
> - mutex_unlock(&bdev->bd_mutex);
> - return -EBUSY;
> + len = -EBUSY;
> + goto out;
> }
>
> /* From now on, anyone can't open /dev/zram[0-9] */
> @@ -1793,7 +1810,10 @@ static ssize_t reset_store(struct device *dev,
>
> mutex_lock(&bdev->bd_mutex);
> zram->claim = false;
> +
> +out:
> mutex_unlock(&bdev->bd_mutex);
> + up_write(&zram_unload);
>
> return len;
> }
> @@ -2015,10 +2035,15 @@ static ssize_t hot_add_show(struct class *class,
> {
> int ret;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> mutex_lock(&zram_index_mutex);
> ret = zram_add();
> mutex_unlock(&zram_index_mutex);
>
> + up_write(&zram_unload);
> +
> if (ret < 0)
> return ret;
> return scnprintf(buf, PAGE_SIZE, "%d\n", ret);
> @@ -2041,6 +2066,9 @@ static ssize_t hot_remove_store(struct class *class,
> if (dev_id < 0)
> return -EINVAL;
>
> + if (!down_write_trylock(&zram_unload))
> + return -ENODEV;
> +
> mutex_lock(&zram_index_mutex);
>
> zram = idr_find(&zram_index_idr, dev_id);
> @@ -2053,6 +2081,7 @@ static ssize_t hot_remove_store(struct class *class,
> }
>
> mutex_unlock(&zram_index_mutex);
> + up_write(&zram_unload);
> return ret ? ret : count;
> }
> static CLASS_ATTR_WO(hot_remove);
> @@ -2078,12 +2107,14 @@ static int zram_remove_cb(int id, void *ptr, void *data)
>
> static void destroy_devices(void)
> {
> + down_write(&zram_unload);
> class_unregister(&zram_control_class);
> idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
> zram_debugfs_destroy();
> idr_destroy(&zram_index_idr);
> unregister_blkdev(zram_major, "zram");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> }
>
> static int __init zram_init(void)
> @@ -2095,10 +2126,13 @@ static int __init zram_init(void)
> if (ret < 0)
> return ret;
>
> + down_write(&zram_unload);
> +
> ret = class_register(&zram_control_class);
> if (ret) {
> pr_err("Unable to register zram-control class\n");
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> return ret;
> }
>
> @@ -2108,6 +2142,7 @@ static int __init zram_init(void)
> pr_err("Unable to get major number\n");
> class_unregister(&zram_control_class);
> cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
> + up_write(&zram_unload);
> return -EBUSY;
> }
>
> @@ -2119,10 +2154,12 @@ static int __init zram_init(void)
> goto out_error;
> num_devices--;
> }
> + up_write(&zram_unload);
>
> return 0;
>
> out_error:
> + up_write(&zram_unload);
> destroy_devices();
> return ret;
> }

As I mentioned above, it didn't close the race I gave as examples
unless I miss something. Let's discuss further.

Thank you!