Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

From: Benjamin Herrenschmidt
Date: Mon Jul 02 2018 - 06:24:06 EST


On Mon, 2018-07-02 at 09:36 +1000, Benjamin Herrenschmidt wrote:
> > No. See above. The reason I think your patch 2/2 is wrong is that is
> > actually *breaks* the above model, exactly because of that thing that
> > you hatre.
> >
> > The explicit removal is actively wrong for the "I want to reuse the
> > object" model, exactly because it happens before the refcount has gone
> > to zero.
>
> That's indeed where we disagree.

Sooo...

I had some quality time with my good friend Oban 14 last night,
thinking through this and during the resulting hangover today. I've
tried very hard to "get" your perspective, but after another dive
through the code, I'm afraid I still think patch 2/2 is the right thing
to do.

(Let's ignore patch 1/1 for now and assume we get that part right)

Let me try one last ditch attempt to convince you using maybe a
different perspective: this is how sysfs is intended to work and how
the device model already does everywhere else except the gluedirs.

Sooo... let's start pulling the string bit by bit...

I think you agree since (I'm not going to try to quote you here) you
did mention adding/removing need to generally be protected by higher
level locks in whatever subsystem owns the kobjects. I don't think
there's an argument here.

This is hinted by some comments (among others) in sysfs/dir.c about the
kobject owner being responsible to ensure that removal doesn't race
with any operation, which is hard to do when such removal doesn't
happen under subsystem internal locks.

Now let's look at how things are handled for a normal struct device
kobject.

The device core provides device_add and device_del functions. Those are
completely orthogonal to the object lifetime controlled by
device_initialize (creation) and device_put. And for good reasons (yes,
there are helpers, device_register/unregister that combine them, but
those are just that... helpers).

Why ? For the exact same reason mentioned above: Along with all the
driver binding/unbinding business etc... those functions ensure that
the kobject of the device is added/removed from sysfs under all the
necessary locks in the device model, so that they cannot race with each
other.

This guarantees among others that it is possible to do a new device_add
for the same device immediately after device_del, without clashing with
a duplicate sysfs name, *even* if something still holds a reference to
the kobject for any reason (and there are plenty of these, see the cdev
example below).

The actual lifetime of the struct device is handled *separately* by
device_get/put which are just wrappers on kobject_get/put.

This is the same with a cdev. cdev_add and cdev_del or their friends
cdev_device_add/del do the same thing. The chardev case is interesting
specifically because it is much more common for these to persist
(lifetime) beyond cdev_del. You just need userspace to have an open
file descriptor to the chardev. Drivers have to deal with it of course,
they need to handle things like read/write/ioctl being called on
already open files after remove/del/unbind, and hopefully do so.

*But* it's also possible to create a new identical cdev with the same
sysfs name immediately after cdev_device_del even if the old one still
has open instanced precisely because we separate the sysfs presence,
handled through the device model internal locking, from the object
lifetime.

Now, going back to our gluedirs, in essence those are just more objects
owned by the device core. More specifically, they need to be created
synchronously with the first child device of a class added under a
given parent, and removed synchronously with the last child device of
that class under that same parent.

Delaying that removal from sysfs simply means that we now are unable to
re-add the same device after removing it for some arbitrary amount of
time, because something might still hold a kobject reference to the
gluedir for any reason, it doesn't matter.

It also means that the deletion from sysfs is no longer synchronized
with addition since it no longer happens from within the context of
decice_del which is the "owner" of the glue dir, under whatever locks
it uses to ensure there is no races between multiple devices addition
removal etc...

In fact, if you look at the code and the way the gdp_mutex is used, I
think this is *exactly* the intention of the code. That the
kobject_put() done inside device_del() is the last one, done under that
mutex, thus removing the gluedir synchronously and we have no race.

However, that code just fails to take into account the delayed removal
by kobject debugging and the possibility of another reference existing.

Now it's possible there there can be no other references of a gluedir
today, to be frank I haven't audited the entire code base to figure out
if anything can possibly hold one that isn't a child of that gluedir,
so if indeed there can't be any and the only references to the kref
possible *ever* on a gluedir are its children, then it's possibe that
the assumption holds, except with kobject debugging.

However, even if that is the case, I call it fragile. I think we should
ensure, which is what patch 2/2 does, that regardless of "when" the
object is actually freed, and regardless of what other references might
exist, the gluedir behaves just like other struct device or cdev, and
gets removed from sysfs synchronously, along with the last child, with
all the necessary device core mutexes held, as to never race with an
addition and as to never trigger the duplicate name problem.

Now, I know why you dislike patch 2/2, because it adds another counter,
and there's something "itchy" about having a refcount and that
childcount.

We don't absolutely need that childcount. It could be that we have ways
to ask sysfs whether there are any children left in that directory,
which would work as well, I just haven't really figured out how, and a
child count felt like a trivial way to solve the problem. It's
protected by the gdp_mutex so it's safe.

There are precedents to such dual counters, such as mm_users vs
mm_count. There are precedents to separating file system visibility
from lifetime, every fs does it with unlink() when there are still open
files, so I think overall, what I'm trying to achieve is just pretty
much standard practice: unlink the gluedir exactly when it needs to be
unlinked so a new one can be created when needed, and deal with
lifetime of the actual structure the usual way which can be delayed.

Now if this doesn't convince you, I'm afraid nothing will ;-)

I'm curious to hear from Greg though.

Cheers,
Ben.