Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

From: Linus Torvalds
Date: Sun Jul 01 2018 - 13:05:25 EST


On Sun, Jul 1, 2018 at 12:16 AM Benjamin Herrenschmidt
<benh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> I suspect you didn't read it my entire argument or I wasn't clear
> enough :-) This is actually the crux of the problem:
>
> Yes the object continues to exist. However, the *last* kobject_put to
> it will no longer be done under whatever higher level locking the
> subsystem provides (whatever prevents for example concurrent add and
> removes).

Well, yes and no.

Why "no"?

The last dropping is actually not necessarily that interesting.
Especially with the sysfs interface, we basically know that you can
look up the object using RCU (since that's what the filesystem lookup
does), and that basically means that the refcount is always the final
serialization mechanism.There is nothing else that can possibly lock
it.

So this is where we disagree:

> Thus in that scenario the "last minute" kobject_release() done by the
> last kobject_put() will be effectively unprotected from for example the
> gdp_mutex (in the case of the gluedirs) or whatever other locking the
> subsystem owning the kobject is using to avoid making that "refount 0"
> object "discoverable".

No. *Fundamentally*, there is only one thing that protects that
object: the refcount.

And my argument is that anything that has this model (which means
anything that has any sysfs linkage, which pretty much means any
kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
had an existing stable pointer and just wants to increase the refcount
(ie "already got a reference throuigh one of my data structures that
use the lock")

But if you have that model, that means that the "last drop" is
actually almost totally irrelevant. Because it doesn't matter if it's
done inside the lock or not - you know that once it has been done,
that object is entirely gone. It's not discoverable any more -
regardless of locking. The discoverability is basically controlled
entirely by the refcount.

So what happens then?

The locking isn't important for the last release, but it *is*
important for new object *creation*.

Why?

The refcount means that once an object is gone, it's gone for
*everyone*. It's a one-way thing, and it's thread-safe. So the code
that does *creation* can do this:

- get subsystem lock
- look up object (using the "unless_zero()" model)
- if you got the object, re-use it, and you're done: drop lock and return
- otherwise, you know that nobody else can get it either
- create new object and instantiate it
- drop lock

and this means that you create a new object IFF the old object had its
refcount drop to zero. So you always have exactly one copy (or no copy
at all, in between the last drop and the creation of the new one).

See? The lack of locking at drop time didn't matter. The refcount
itself serialized things.

So the above is what I *think* the "glue_dir" logic should be. No need
for any other count. Just re-use the old glue dir if you can find it,
and create a new one if you can't.

The one important thing is that the object lookup above needs to use
the lookup needs to find the object all the way until the refcount has
become zero. And that actually means that the object MUST NOT be
removed from the object lists until *after* the refcount has been
decremented to zero. Which is actually why that "automatic cleanup"
that you hate is actually an integral and important part of the
process: removing the object *before* the refcount went to zero is
broken, because that means that the "look up object" phase can now
miss an object that still has a non-zero refcount.

> So my patch 1/2 prevents us from finding the old dying object (and thus
> from crashing) but replaces this with the duplicate name problem.

So I absolutely agree with your patch 1/2. My argument against it is
actually that I think the "unless_zero" thing needs to be more
universal.

> My patch 2/2 removes that second problem by ensuring we remove the
> object from sysfs synchronously in device_del when it no longer
> contains any children, explicitely rather than implicitely by the
> virtue of doing the "last" kobject_put.

No. See above. The reason I think your patch 2/2 is wrong is that is
actually *breaks* the above model, exactly because of that thing that
you hatre.

The explicit removal is actively wrong for the "I want to reuse the
object" model, exactly because it happens before the refcount has gone
to zero.

> > No. That the zero kobject_get() will not result in a warning. It just
> > does a kref_get(), no warnings anywhere.
>
> It's there but it's in refcount:
>
> void refcount_inc(refcount_t *r)
> {
> WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
> }
> EXPORT_SYMBOL(refcount_inc);
>
> In fact that's how I started digging into that problem in the first place :-)

Hey, you are again hitting this because of extra config options.

Because the refcount_inc() that I found looks like this:

static inline void refcount_inc(refcount_t *r)
{
atomic_inc(&r->refs);
}

and has no warning.

I wonder how many people actually run with REFCOUNT_FULL that warns -
because it's way too expensive. It's not set in the default Fedora
config, for example, and that's despite how Fedora tends to set all
the other debug options.

So no, it really doesn't warn normally.

Linus