Re: [PATCH v3] drivers/base/core: refcount kobject and bus on device attribute read / store

From: Luis Chamberlain
Date: Wed Jun 23 2021 - 12:14:41 EST


On Wed, Jun 23, 2021 at 10:32:53AM +0200, Greg KH wrote:
> On Tue, Jun 22, 2021 at 05:36:30PM -0700, Luis Chamberlain wrote:
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index 4a8bf8cda52b..f69aa040b56d 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -2042,28 +2042,56 @@ EXPORT_SYMBOL(dev_driver_string);
> > static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
> > char *buf)
> > {
> > - struct device_attribute *dev_attr = to_dev_attr(attr);
> > - struct device *dev = kobj_to_dev(kobj);
> > + struct device_attribute *dev_attr;
> > + struct device *dev;
> > + struct bus_type *bus = NULL;
> > ssize_t ret = -EIO;
> >
> > + dev = get_device(kobj_to_dev(kobj));
> > + if (dev->bus) {
>
> No need to test for this, right?

dev_uevent() checks for dev->bus, so I thought that was a clear
indication this isn't always set.

>
> > + bus = bus_get(dev->bus);
> > + if (!bus)
> > + goto out;
>
> How can this happen?

device_add() calls bus_add_device(), and the bus_add_device()
implementation seems to have a check for the bus returned from bus_get() as
well?

> > + }
> > +
> > + dev_attr = to_dev_attr(attr);
>
> Why did you move this call to way down here? It's fine in the first
> line of the function above as-is.

I had done that when I had the device kobject reference incremented.
I'll move it back up.

> > if (dev_attr->show)
> > ret = dev_attr->show(dev, dev_attr, buf);
> > if (ret >= (ssize_t)PAGE_SIZE) {
> > printk("dev_attr_show: %pS returned bad count\n",
> > dev_attr->show);
> > }
> > +
> > + bus_put(bus);
>
> You are incrementing the bus, which is nice, but I do not understand why
> it is needed. What is causing the bus to go away _before_ the devices
> are going away? Busses almost never are removed from the system, and if
> they are, all devices associated with them are removed first. So I do
> not think you need to increment anything with that here.

You tell me. It was your suggestion as a replacement for the type
specific lock, in the zram case, its a block device so I was using
bdgrab().

> But step back, what prevented the kobject from decrementing between the
> call to dev_attr_show() and the first line of the function?
>
> The kobject needs to be incremented before show is called, right? Or
> does kernfs handle this properly already?

kernfs / sysfs is in not struct device specific, it has no semantics for
modules or even devices. The aspect of kernfs which deals with locking
a struct kernfs_node is kernfs_get_active(). The refcount there of
importance is the call to atomic_inc_unless_negative(&kn->active).

struct kernfs_node *kernfs_get_active(struct kernfs_node *kn)
{
if (unlikely(!kn))
return NULL;

if (!atomic_inc_unless_negative(&kn->active))
return NULL;

if (kernfs_lockdep(kn))
rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP_);
return kn;
}

BTW this was also precisely where I had suggested to extend the
kernfs_node with an optional kn->owner and if set we try_module_get() to
prevent the deadlock case if the module exit routine also happens to use
a lock also used by a sysfs attribute store/read.

The flow would be (from a real live gdb crash backtrace from an original
bug report from a customer):

write system call -->
ksys_write ()
vfs_write() -->
__vfs_write() -->
kernfs_fop_write() (note now this is kernfs_fop_write_iter()) -->
sysfs_kf_write() -->
dev_attr_store() -->
null reference

The dereference is because the dev_attr_store() call which we are
modifying

LINE-001 static ssize_t dev_attr_store(struct kobject *kobj, struct attribute *attr,
LINE-002 const char *buf, size_t count)
LINE-003 {
LINE-004 struct device_attribute *dev_attr = to_dev_attr(attr);
LINE-005 struct device *dev = kobj_to_dev(kobj);
LINE-006 ssize_t ret = -EIO;
LINE-007
LINE-008 if (dev_attr->store)
LINE-009 ret = dev_attr->store(dev, dev_attr, buf, count);
...
}

The race happens because a sysfs store / read can be triggered, the CPU
could be preempted after LINE-008, and the ->store is gone by LINE-009.
This begs the question if kernfs_fop_write_iter() or sysfs protects this
somehow? Let's see.

For kernfs we have:

static ssize_t kernfs_fop_write_iter(struct kiocb *iocb, struct iov_iter *iter)
{
struct kernfs_open_file *of = kernfs_of(iocb->ki_filp);
...
mutex_lock(&of->mutex);
if (!kernfs_get_active(of->kn)) {
mutex_unlock(&of->mutex);
len = -ENODEV;
goto out_free;
}

ops = kernfs_ops(of->kn);
if (ops->write)
len = ops->write(of, buf, len, iocb->ki_pos);
else
len = -EINVAL;

kernfs_put_active(of->kn);
mutex_unlock(&of->mutex);
...
}

And the write call here is a syfs calls:

static ssize_t sysfs_kf_write(struct kernfs_open_file *of, char *buf,
size_t count, loff_t pos)
{
const struct sysfs_ops *ops = sysfs_file_ops(of->kn);
struct kobject *kobj = of->kn->parent->priv;

if (!count)
return 0;

return ops->store(kobj, of->kn->priv, buf, count);
}

As we have observed already, the active reference obtained through
kernfs_get_active() was for the struct kernfs_node. Sure, the
ops->write() is valid, in this case it sysfs_kf_write().

sysfs isn't doing any active reference check for the kobject device
attribute as it doesn't care for them. So syfs calls
dev_attr_store(), but the dev_attr_store() is not preventing the device
attribute ops to go fishing, and we destroy them while we're destroying
the device on module removal.

In the zram case, the abstraction is worse given the device attributes
are are created on behalf of the driver as group attributes. The zram
disksize_store() for instance will use:

static inline struct zram *dev_to_zram(struct device *dev)
{
return (struct zram *)dev_to_disk(dev)->private_data;
}

static ssize_t disksize_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t len)
{
...
struct zram *zram = dev_to_zram(dev);
...
}

Nothing is preventing an active block group sysfs attribute from going
fishing, in the dev_to_disk() is the struct gendisk, and that can get
free'd on module exit while the sysfs attribute is about to be poked.

> I don't have the time at the
> moment to dig into this, but maybe this isn't an issue? Look at how
> sysfs handles the kobject lookups for kernfs nodes to see if all is sane
> there, maybe there isn't an issue?

The deadlock issue I noted and am fixing with the zram driver was one
real live situation a customer reported, however a null dereference
illustrated above turned out to also be in the logs of the same bug
report, just that it happened in other situations, so what Minchan
claimed as theoretical actually indeed was a real issue and I hope
that the above illustrates this better.

Luis