Re: [PATCH 4/4] [RFC!] mm: 'struct mm_struct' reference counting debugging

From: Michal Hocko
Date: Fri Dec 16 2016 - 04:02:09 EST


On Fri 16-12-16 09:22:02, Vegard Nossum wrote:
> Reference counting bugs are hard to debug by their nature since the actual
> manifestation of one can occur very far from where the error is introduced
> (e.g. a missing get() only manifest as a use-after-free when the reference
> count prematurely drops to 0, which could be arbitrarily long after where
> the get() should have happened if there are other users). I wrote this patch
> to try to track down a suspected 'mm_struct' reference counting bug.

I definitely agree that hunting these bugs is a royal PITA, no question
about that. I am just wondering whether this has been motivated by any
particular bug recently. I do not seem to remember any such an issue for
quite some time.

> The basic idea is to keep track of all references, not just with a reference
> counter, but with an actual reference _list_. Whenever you get() or put() a
> reference, you also add or remove yourself, respectively, from the reference
> list. This really helps debugging because (for example) you always put a
> specific reference, meaning that if that reference was not yours to put, you
> will notice it immediately (rather than when the reference counter goes to 0
> and you still have an active reference).

But who is the owner of the reference? A function/task? It is not all
that uncommon to take an mm reference from one context and release it
from a different one. But I might be missing your point here.

> The main interface is in <linux/mm_ref_types.h> and <linux/mm_ref.h>, while
> the implementation lives in mm/mm_ref.c. Since 'struct mm_struct' has both
> ->mm_users and ->mm_count, we introduce helpers for both of them, but use
> the same data structure for each (struct mm_ref). The low-level rules (i.e.
> the ones we have to follow, but which nobody else should really have to
> care about since they use the higher-level interface) are:
>
> - after incrementing ->mm_count you also have to call get_mm_ref()
>
> - before decrementing ->mm_count you also have to call put_mm_ref()
>
> - after incrementing ->mm_users you also have to call get_mm_users_ref()
>
> - before decrementing ->mm_users you also have to call put_mm_users_ref()
>
> The rules that most of the rest of the kernel will care about are:
>
> - functions that acquire and return a mm_struct should take a
> 'struct mm_ref *' which it can pass on to mmget()/mmgrab()/etc.
>
> - functions that release an mm_struct passed as a parameter should also
> take a 'struct mm_ref *' which it can pass on to mmput()/mmdrop()/etc.
>
> - any function that temporarily acquires a mm_struct reference should
> use MM_REF() to define an on-stack reference and pass it on to
> mmget()/mmput()/mmgrab()/mmdrop()/etc.
>
> - any structure that holds an mm_struct pointer must also include a
> 'struct mm_ref' member; when the mm_struct pointer is modified you
> would typically also call mmget()/mmgrab()/mmput()/mmdrop() and they
> should be called with this mm_ref
>
> - you can convert (for example) an on-stack reference to an in-struct
> reference using move_mm_ref(). This is semantically equivalent to
> (atomically) taking the new reference and dropping the old one, but
> doesn't actually need to modify the reference count

This all sounds way too intrusive to me so I am not really sure this is
something we really want. A nice thing for debugging for sure but I am
somehow skeptical whether it is really worth it considering how many
those ref. count bugs we've had.

[...]
--
Michal Hocko
SUSE Labs