Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

From: Benjamin Herrenschmidt
Date: Sun Jul 01 2018 - 19:37:02 EST


On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
>
> So this is where we disagree:
>
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
>
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
>
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
>
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant. Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.

This is almost true... the issue there is that gap of time between the
refcount going to 0 and kobject_release() actually doing the
kobject_del. Thus the object will be present in the kset for example,
and thus discoverable, while its refcount is 0.

Yes, you are absolutely right, anything that can find such a dead
object must be using kobject_get_unless_zero(). This is what patch 1/2
does and it fixes the basic crashing race.

But we are then exposed to the duplicate name issue on fast bind/unbind
or add/remove.

> So what happens then?
>
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
>
> Why?
>
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
>
> - get subsystem lock
> - look up object (using the "unless_zero()" model)
> - if you got the object, re-use it, and you're done: drop lock and return
> - otherwise, you know that nobody else can get it either
> - create new object and instantiate it
> - drop lock
>
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).

Except that create will occasionally fail due to duplicate names. So we
need to fix that one way or another (see below)

> See? The lack of locking at drop time didn't matter. The refcount
> itself serialized things.

Oh I absolutely agree with all that when it comes to object lifetime, I
don't completely agree when it comes to visibility to the outside
world, see below.

> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
>
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.

In the case of gluedir that is perfectly fine though. It's a bit like
having an open unlinked file. You want to "unlink" the gluedir
synchronously with the last of its children going away, in order to be
able to create a new one as soon as a new children comes in. However
you don't want to "free" the underlying kobject until after all users
have dropped their reference.

What I don't like is that idea of conflating discoverability and
lifetime. We may just have to agree to disagree on that one, but unless
we can make kernfs&sysfs prevent "finding" of a 0-ref object and thus
allow creating of a duplicate in that case (which isn't that simple a
problem to solve), we still have a problem.

> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
>
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.

I agree with that, absolutely. kobject_get() should probably just be an
unless_zero variant always rather than a special case.

> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when it no longer
> > contains any children, explicitely rather than implicitely by the
> > virtue of doing the "last" kobject_put.
>
> No. See above. The reason I think your patch 2/2 is wrong is that is
> actually *breaks* the above model, exactly because of that thing that
> you hatre.
>
> The explicit removal is actively wrong for the "I want to reuse the
> object" model, exactly because it happens before the refcount has gone
> to zero.

That's indeed where we disagree.

In fact that's the entire point of the gdb_mutex today, ie the code was
clearly written with the *intention* that kobject_put() in device_del
is the last thing happening to the glue dir so that it gets
synchronously removed from the kset and sysfs along with the last
child.

This is the only reason why it makes any sense to be taking that mutex
around kobject_put.

It does so in order to avoid the race alltogether.

And that assumption in the code today is wrong because there could be
another reference being held, delaying the removal from the kset and
sysfs to outside of that mutex.

So kobject debugging's delayed freeing doesn't create a new race here,
it just exposes more widely an existing one.

So to get back to "how do we fix things", I think we agree with patch 1
and we agree that we should probably make kobject_get unconditionally
be a "unless zero" version accross the board, but let's Greg chime in
here before we do anything drastic.

This fixes the use-after-free and crash. However it does open us to a
duplicate name window where we'll fail to re-bind or re-add a device
due to a late un-reference. Kobject debugging makes this very likely to
happen by delaying the freeing by seconds, but without it, there' still
a small window.

There's two ways we've proposed to fix this. One is my patch 2 which
you don't like, which basically is equivalent to saying let's unlink
the gluedir synchronously with the last children going away, and keep
the kobeject refcount strictly as a mean to deal with the liftime of
the structure (freeing it). It uses the existing gdp_mutex and just
makes the existing assumptions in the code work.

The other way is what you prefer, which I agree is a good idea, but it
will take me at least some time to figure out a way to actually
implement it, and it is to essentially change sysfs so that objects
that have a refount of 0 aren't discoverable anymore, so that we don't
name-clash on trying to create a new one.

It's tricky because all the name business is handled by kernfs which
has no idea about kobjects of krefs. My initial idea would be to add an
optional pointer to a kref to kernfs entries, so it can check the
refcount on lookups and name collision checks. I'm not entirely certain
how to work out the locking/ordering for this though, we might need to
add memory barriers in kobject_put to match.

> > > No. That the zero kobject_get() will not result in a warning. It just
> > > does a kref_get(), no warnings anywhere.
> >
> > It's there but it's in refcount:
> >
> > void refcount_inc(refcount_t *r)
> > {
> > WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
> > }
> > EXPORT_SYMBOL(refcount_inc);
> >
> > In fact that's how I started digging into that problem in the first place :-)
>
> Hey, you are again hitting this because of extra config options.

Ah possibly. I enabled every debug option I could find :-)
>
> Because the refcount_inc() that I found looks like this:
>
> static inline void refcount_inc(refcount_t *r)
> {
> atomic_inc(&r->refs);
> }
>
> and has no warning.
>
> I wonder how many people actually run with REFCOUNT_FULL that warns -
> because it's way too expensive. It's not set in the default Fedora
> config, for example, and that's despite how Fedora tends to set all
> the other debug options.
>
> So no, it really doesn't warn normally.

Yup, you're right.

Cheers,
Ben.