Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

From: Benjamin Herrenschmidt
Date: Mon Jul 02 2018 - 06:22:43 EST

On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
> On Sun, Jul 1, 2018 at 12:16 AM Benjamin Herrenschmidt
> <benh@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > I suspect you didn't read it my entire argument or I wasn't clear
> > enough :-) This is actually the crux of the problem:
> >
> > Yes the object continues to exist. However, the *last* kobject_put to
> > it will no longer be done under whatever higher level locking the
> > subsystem provides (whatever prevents for example concurrent add and
> > removes).
> Well, yes and no.
> Why "no"?
> The last dropping is actually not necessarily that interesting.
> Especially with the sysfs interface, we basically know that you can
> look up the object using RCU (since that's what the filesystem lookup
> does), and that basically means that the refcount is always the final
> serialization mechanism.There is nothing else that can possibly lock
> it.
> So this is where we disagree:
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant. Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.
> So what happens then?
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
> Why?
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
> - get subsystem lock
> - look up object (using the "unless_zero()" model)
> - if you got the object, re-use it, and you're done: drop lock and return
> - otherwise, you know that nobody else can get it either
> - create new object and instantiate it
> - drop lock
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).
> See? The lack of locking at drop time didn't matter. The refcount
> itself serialized things.
> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.
> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.
> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when it no longer
> > contains any children, explicitely rather than implicitely by the
> > virtue of doing the "last" kobject_put.
> No. See above. The reason I think your patch 2/2 is wrong is that is
> actually *breaks* the above model, exactly because of that thing that
> you hatre.
> The explicit removal is actively wrong for the "I want to reuse the
> object" model, exactly because it happens before the refcount has gone
> to zero.
> > > No. That the zero kobject_get() will not result in a warning. It just
> > > does a kref_get(), no warnings anywhere.
> >
> > It's there but it's in refcount:
> >
> > void refcount_inc(refcount_t *r)
> > {
> > WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
> > }
> > EXPORT_SYMBOL(refcount_inc);
> >
> > In fact that's how I started digging into that problem in the first place :-)
> Hey, you are again hitting this because of extra config options.
> Because the refcount_inc() that I found looks like this:
> static inline void refcount_inc(refcount_t *r)
> {
> atomic_inc(&r->refs);
> }
> and has no warning.
> I wonder how many people actually run with REFCOUNT_FULL that warns -
> because it's way too expensive. It's not set in the default Fedora
> config, for example, and that's despite how Fedora tends to set all
> the other debug options.
> So no, it really doesn't warn normally.
> Linus