Re: [PATCH v2 0/4] Reference count checker and related fixes

From: Ian Rogers
Date: Fri Jan 28 2022 - 01:25:17 EST

Next message: John Hubbard: "Re: [PATCH v2] mm: Fix invalid page pointer returned with FOLL_PIN gups"
Previous message: Herbert Xu: "Re: [PATCH] crypto: sun8i-ss: really disable hash on A80"
In reply to: Masami Hiramatsu: "Re: [PATCH v2 0/4] Reference count checker and related fixes"
Next in thread: Masami Hiramatsu: "Re: [PATCH v2 0/4] Reference count checker and related fixes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jan 27, 2022 at 9:24 PM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Thu, 27 Jan 2022 13:33:23 -0800
> Ian Rogers <irogers@xxxxxxxxxx> wrote:
>
> > On Tue, Jan 25, 2022 at 12:46 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> > >
> > > This v2 patch set has the main reference count patch for cpu map from
> > > the first set and then adds reference count checking to nsinfo. The
> > > reference count checking on nsinfo helped diagnose a data race bug
> > > which is fixed in the independent patches 2 and 3.
> > >
> > > The perf tool has a class of memory problems where reference counts
> > > are used incorrectly. Memory/address sanitizers and valgrind don't
> > > provide useful ways to debug these problems, you see a memory leak
> > > where the only pertinent information is the original allocation
> > > site. What would be more useful is knowing where a get fails to have a
> > > corresponding put, where there are double puts, etc.
> > >
> > > This work was motivated by the roll-back of:
> > > https://lore.kernel.org/linux-perf-users/20211118193714.2293728-1-irogers@xxxxxxxxxx/
> > > where fixing a missed put resulted in a use-after-free in a different
> > > context. There was a sense in fixing the issue that a game of
> > > wac-a-mole had been embarked upon in adding missed gets and puts.
> > >
> > > The basic approach of the change is to add a level of indirection at
> > > the get and put calls. Get allocates a level of indirection that, if
> > > no corresponding put is called, becomes a memory leak (and associated
> > > stack trace) that leak sanitizer can report. Similarly if two puts are
> > > called for the same get, then a double free can be detected by address
> > > sanitizer. This can also detect the use after put, which should also
> > > yield a segv without a sanitizer.
> > >
> > > Adding reference count checking to cpu map was done as a proof of
> > > concept, it yielded little other than a location where the use of get
> > > could be cleaner by using its result. Reference count checking on
> > > nsinfo identified a double free of the indirection layer and the
> > > related threads, thereby identifying a data race as discussed here:
> > > https://lore.kernel.org/linux-perf-users/CAP-5=fWZH20L4kv-BwVtGLwR=Em3AOOT+Q4QGivvQuYn5AsPRg@xxxxxxxxxxxxxx/
> > > Accordingly the dso->lock was extended and use to cover the race.
> > >
> > > An alternative that was considered was ref_tracker:
> > > https://lwn.net/Articles/877603/
> > > ref_tracker requires use of a reference counted struct to also use a
> > > cookie/tracker. The cookie is combined with data in a ref_tracker_dir
> > > to spot double puts. When an object is finished with leaks can be
> > > detected, as with this approach when leak analysis happens. This
> > > approach was preferred as it doesn't introduce cookies, spots use
> > > after put and appears moderately more neutral to the API. Weaknesses
> > > of the implemented approcah are not being able to do adhoc leak
> > > detection and a preference for adding an accessor API to structs. I
> > > believe there are other issues and welcome suggestions.
> >
> > And so we've been here before (Dec 2015 to be exact). Namhyung pointed me to:
> > https://lore.kernel.org/all/20151209021047.10245.8918.stgit@localhost.localdomain/
> > by Masami Hiramatsu. In this work he adds a leak sanitizer style
> > reference count checker that will describe locations of puts and gets
> > for diagnosis. Firstly that's an awesome achievement! This work is
> > different in that it isn't trying to invent a leak sanitizer, it is
> > just using the existing one. By adding a level of indirection this
> > work can catch use after put and pairs gets with puts to make lifetime
> > analysis more automatic. An advantage of Masami's work is that it
> > doesn't change data-structures and after the initial patch-set is
> > somewhat transparent. Overall I prefer the approach in these patches,
> > future patches can look to clean up the API as Masami has.
>
> Thanks for referring my series :-D The series aimed to solve the refcount
> usage issue in the perf which lead the object leaks. At that moment,
> I found that there were 2 patterns, refcount start from 0 and start from 1.
> That made me confused what I should do for using a object.
> But the perf uses linux/refcount.h now, I hope such issue has already gone.
> (but the object leakage seems not fixed fully yet, as you found.)
>
> BTW, I think the introducing UNWRAP_* macro may put a burden on future
> development. If it is inevitable, we should consider it as carefully as
> possible. Or, it may cause another issue (it is easily missed that the new
> patch does not use UNWRAP_* for object reference, because it is natual.)
>
> So I agree with you that you to clean up the API. :-)
> I think we can make yet another refcount.h for user space debugging and
> replace it with the linux/refcount.h.

Thanks Masami,

Agreed on the UNWRAP_ macros, hence wanting to hide them behind
accessors. Making accessors could be automated with macros, for
example, have a list of variables, have a macro declare the struct
using the list, another macro can use the list to declare accessors. I
didn't find adding the UNWRAP_ macros in this change particularly
burdensome as any use of the wrapping pointer as the original type
caused a compile time error telling you what and where to fix. The
macro is extra stuff in the way of using just the raw object, but
that's fairly typical in C++ with shared_ptr, scoped_lock, etc. The
question is, is it worth it to make sure use of the reference counted
object is correct and misuse is easier to diagnose? I think it is near
as least offensive as possible while providing the best information -
hence being able to solve the dso->nsinfo put data race, that has been
a problem to solve up to this point. I'm open to better suggestions
though :-)

Thanks again,
Ian

> Thank you,
>
> --
> Masami Hiramatsu <mhiramat@xxxxxxxxxx>

Next message: John Hubbard: "Re: [PATCH v2] mm: Fix invalid page pointer returned with FOLL_PIN gups"
Previous message: Herbert Xu: "Re: [PATCH] crypto: sun8i-ss: really disable hash on A80"
In reply to: Masami Hiramatsu: "Re: [PATCH v2 0/4] Reference count checker and related fixes"
Next in thread: Masami Hiramatsu: "Re: [PATCH v2 0/4] Reference count checker and related fixes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]