Re: [PATCH 2.6.28-rc5 01/11] kmemleak: Add the base support

From: Paul E. McKenney
Date: Sun Dec 07 2008 - 18:20:05 EST


On Sat, Dec 06, 2008 at 11:07:30PM +0000, Catalin Marinas wrote:
> On Thu, 2008-12-04 at 08:55 -0800, Paul E. McKenney wrote:
> > On Thu, Dec 04, 2008 at 12:14:26PM +0000, Catalin Marinas wrote:
> > > On Wed, 2008-12-03 at 10:12 -0800, Paul E. McKenney wrote:
> > > > On Thu, Nov 20, 2008 at 11:30:34AM +0000, Catalin Marinas wrote:
> > > > > This patch adds the base support for the kernel memory leak
> > > > > detector. It traces the memory allocation/freeing in a way similar to
> > > > > the Boehm's conservative garbage collector, the difference being that
> > > > > the unreferenced objects are not freed but only shown in
> > > > > /sys/kernel/debug/memleak. Enabling this feature introduces an
> > > > > overhead to memory allocations.
> > > >
> > > > I have some concerns about your locking/RCU design, please see
> > > > interspersed questions and comments.
> [...]
> > > I'll comment below as well but I'll first show my logic which I added as
> > > comment at the top of the memleak.c file:
> >
> > This is very helpful, thank you! (And please accept my apologies if I
> > missed seeing it in my earlier review.)
>
> You haven't missed it, that's the first time I posted this text.

Whew! ;-)

> > > * The following locks are used by kmemleak:
> > > *
> > > * - object_list_lock (spinlock): protects the object_list. This is the main
> > > * list holding the metadata (struct memleak_object) for the allocated
> > > * objects. The memleak_object structures are added to the list in the
> > > * create_object() function called from the memleak_alloc() callback. They
> > > * are removed from the list in put_object() if the object->use_count is 0
> >
> > This part sounds good. I would also add that once object->use_count is
> > zero, no one is allowed to increment it. Also, attempts to increment
> > object->use_count must be under the protection of rcu_read_lock(),
> > correct?
>
> So, to make sure I understand it correctly, the rcu_read_lock() is
> needed to protect between the point where the object pointer was
> obtained to the get_object() point.

More generally, it ensures that an RCU-protected object stays around
until the matching rcu_read_unlock() is reached. The guarantee call_rcu()
provides is to invoke the specified function only after all pre-existing
RCU read-side critical sections have completed, in other words, only
after any task that previously executed rcu_read_lock() has executed
the matching rcu_read_unlock().

> Would it also work if
> spin_lock_irqsave(object_list_lock) is used instead of rcu_read_lock()?
> The call_rcu() in put_object is bracketed with object_list_lock.

For some RCU implementations, but only by accident.

Just please don't do this!

Instead, consider wrapping rcu_read_lock() and rcu_read_unlock() around
the region of interest, if need be. The RCU read-side primitives are
quite cheap, so you are losing almost nothing (and in some cases exactly
nothing) by using them.

> BTW, I'll have a look if I could remove an object from the object_list
> in delete_object() rather than waiting until put_object().

If by doing this, you exclude all readers while removing, you are set.

> > > * - object_tree_lock (rwlock): protects the object_tree_root. When the
> > > * metadata is created in create_object(), it is added to the object prio
> > > * search tree and removed in delete_object() with this lock held
> > > * (write_lock). This lock is also acquired (read_lock) in
> > > * find_and_get_object() when an object needs to be looked up by a pointer
> > > * value (either during scanning or when changing its properties like
> > > * marking as false positive)
> >
> > Looks OK. I must confess that I am a bit fuzzy on the purpose of
> > object_tree_root vs. object_list.
>
> object_list holds all the memleak_objects in the system and it is
> traversed when preparing the scanning and also when reporting the leaks.
>
> object_tree_root is used to look-up memleak_objects by the allocated
> memory block. In the past, this used to be a radix tree (with some
> lockdep problems) and later a hash. I now use a prio tree because it
> allows pointer ranges.
>
> Kmemleak could probably iterate over the object_tree_root when reporting
> but it is more convenient to report the leaks in the order they were
> allocated (preserved by object_list) since one leak may trigger many
> subsequent reports but they disappear once the first one is solved.

Very helpful, thank you!

> > > * - memleak_object.lock (spinlock): protects a memleak_object. Modifications
> > > * of the metadata (e.g. count) are protected by this lock. Note that some
> > > * members of this structure may be protected by other means (atomic or
> > > * object_list lock). This lock is also held when scanning the corresponding
> > > * object to avoid the kernel freeing it via the memleak_free() callback.
> > > * This is less heavyweight than holding a global lock like object_list_lock
> > > * during scanning
> >
> > OK, holding an object's lock can protect that object from deletion,
> > but only after you actually acquire the lock. There must be some other
> > mechanism preventing the object from being freed during the actual
> > acquisition of the lock.
> >
> > Now this might be the object_list_lock, object_tree_lock, RCU, or some
> > combination of the three, for example it might depend on how that object
> > is looked up.
>
> Correct. I'll have to review this again.

Fair enough!

> > > * Freeing a memleak_object is done via an RCU callback invoked from
> > > * put_object() when its use_count is 0 and after removing it from the
> > > * object_list. One of the reasons for the RCU is to delay the freeing and
> > > * avoid a recursive call into the allocator via kmem_cache_free(). Another
> > > * reason is to allow lock-less object_list traversal during memleak_scan().
> >
> > I did figure out the lock-less object_list traversal, but totally missed
> > the fact that you were using RCU to prevent infinite recursion. Cute!
>
> It wasn't documented, so pretty hard to guess.

I guess I don't feel quite so bad, then. ;-)

> > Also, the memleak_object must have been removed from object_tree before
> > its use_count can possibly go to 0, correct?
>
> Yes.

Good!

> > > > > +static void free_object_rcu(struct rcu_head *rcu)
> > > > > +{
> > > > > + struct hlist_node *elem, *tmp;
> > > > > + struct memleak_scan_area *area;
> > > > > + struct memleak_object *object =
> > > > > + container_of(rcu, struct memleak_object, rcu);
> > > > > +
> > > > > + /* once use_count is 0, there is no code accessing the object */
> > > >
> > > > OK, so we won't pass free_object_rcu() to call_rcu() until use_count
> > > > is equal to zero. And once use_count is zero, it is never incremented.
> > > > So the point of the RCU grace period is to ensure that all tasks
> > > > who didn't quite call get_object() soon enough get done failing
> > > > before we free up the object, correct?
> > > >
> > > > Which means that get_object() needs to be under rcu_read_lock()...
> > >
> > > My view here is that if use_count is 0, no other thread would be able to
> > > use this object. It will also be removed from the object_list and hence
> > > no other way to get this this object.
> >
> > What if some other CPU picked up a pointer to the object just before it
> > was removed from the list? If that CPU was not under the protection of
> > rcu_read_lock(), and if that CPU was delayed, then the object could be
> > freed (and possibly re-allocated as something else) before the CPU got
> > around to doing the atomic_inc_not_zero().
>
> OK, I got it now.

It can indeed be a bit subtle, to be sure.

> > It looks to me that the code currently does the right thing here, just
> > want to make sure I understand the locking and that we don't end up
> > tempting someone later to break it. ;-)
>
> I'll document it better and make sure it's clear for me as well.

Sounds good, look forward to seeing the next version!

> > > > > +static void put_object(struct memleak_object *object)
> > > > > +{
> > > > > + unsigned long flags;
> > > > > +
> > > > > + if (!atomic_dec_and_test(&object->use_count))
> > > > > + return;
> > > > > +
> > > > > + /* should only get here after delete_object was called */
> > > > > + BUG_ON(object->flags & OBJECT_ALLOCATED);
> > > > > +
> > > > > + spin_lock_irqsave(&object_list_lock, flags);
> > > >
> > > > We also need to write-hold the object_tree_lock, not?
> > >
> > > Not here, the memleak_object is removed from the object_tree in the
> > > delete_object() function (via from the memleak_free callback). If it is
> > > in the object_tree, it should have a use_count >= 1.
> >
> > So the code never calls the last put_object() without first having
> > called delete_object() to remove it from the object_tree? The "last"
> > put_object() being the one that decrements object->use_count to zero.
>
> Yes.

Good!

> > > > > +static void *memleak_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> > > > > +{
> > > > > + struct list_head *n;
> > > > > + struct memleak_object *next = NULL;
> > > > > + unsigned long flags;
> > > > > +
> > > > > + ++(*pos);
> > > > > + if (reported_leaks >= REPORTS_NR)
> > > > > + goto out;
> > > > > +
> > > > > + spin_lock_irqsave(&object_list_lock, flags);
> > > >
> > > > Using a spinlock instead of rcu_read_lock() is OK, but only if all
> > > > updates are also protected by this same spinlock. Which means that,
> > > > given that find_and_get_object read-acquires object_tree_lock, deletions
> > > > must be projected both by object_list_lock and by write-acquiring
> > > > object_tree_lock. Or all calls to memleak_seq_next need to be covered
> > > > by rcu_read_lock().
> > >
> > > The spin_lock here is only to retrieve the next object in the list but I
> > > agree that even if the object_list modifications are protected by
> > > object_list_lock, put_object could actually set the use_count to 0 and
> > > get_object return in this function isn't checked. If get_object returns
> > > successfully, I don't think an rcu_read_lock() is needed since
> > > put_object() can no longer invoke free_object_rcu().
> >
> > OK, so let me see if I understand:
> >
> > The memleak_object passed in via the "v" argument to
> > memleak_seq_next() was get_object()-ed by some prior
> > call, either an earlier memleak_seq_next() or presumably
> > by memleak_seq_start().
>
> Yes.

OK, good! I might be (slowly) catching on here.

> > memleak_seq_start() -does- do its scan under RCU protection,
> > so looks OK.
> >
> > I believe you also need RCU protection in memleak_seq_next()
> > to prevent the next memleak_object from disappearing
> > during the traversal. Yes, you do greatly decrease the
> > odds of this happening by having irqs disabled, but the
> > fact is that RCU is within its rights to end a grace
> > period during this time.
>
> I'll try to make the memleak_seq_next() function use rcu_dereference()
> and rcu_read_lock(). ATM, the "v" object has use_count >= 1 (from a
> previous get_object) and the next pointer is accessed under
> object_list_lock, so no modifications to the list (even put_object
> acquires this lock when invoking call_rcu). There is still the bug with
> not checking the get_object() return.

Fair enough!

> > Assuming that I do understand, as you say, if the get_object() in
> > memleak_seq_next() fails, we could end up accessing freed-up memory on
> > the next call to memleak_seq_next(), or even during the current one,
> > assuming an aggressive RCU or an extended NMI, SMI, burst of ECC errors
> > or some other delay. So I agree that it is necessary to check the return
> > value of get_object().
>
> Yes

Whew!

> > Also, I do see the put_object() call in memleak_seq_stop(), but it looks
> > to me that this only does a put_object() on the last memleak_object.
> > What does the put_object() on the earlier memleak_object structures that
> > were scanned by memleak_seq_next? Or is there never more than one
> > such object in a given list?
>
> The previous objects' use_count are decremented in memleak_seq_next()
> just before returning "next", so between seq_start-seq_next and
> seq_next-seq_stop, there is only one object with an incremented
> use_count. The memleak_seq_next() function may have two such objects for
> a small period of time.
>
> I actually added a test for this in memleak_scan() (if DEBUG is defined)
> and I've never got any reports. There may be some situations when for
> very short periods of time the use_count is > 1 at the beginning of a
> scan, usually when one of the memleak_scan_area or memleak_ignore
> callbacks are invoked.
>
> I'll revise the locking a bit and re-post the patches this week.

I look forward to seeing them!

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/