Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

From: Shakeel Butt
Date: Tue Dec 01 2020 - 19:37:49 EST


On Tue, Dec 1, 2020 at 4:16 PM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
>
> On Tue, Dec 1, 2020 at 12:53 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> >
> > +Tejun Heo
> >
> > On Tue, Dec 1, 2020 at 11:14 AM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Dec 1, 2020 at 10:42 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Dec 1, 2020 at 9:56 AM Greg Thelen <gthelen@xxxxxxxxxx> wrote:
> > > > >
> > > > > Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> > > > > >>
> > > > > >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
> > > > > >> >
> > > > > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> > > > > >> > is that an ongoing trace event might race with the tracepoint being
> > > > > >> > disabled (and therefore the _unreg() callback being called). Consider
> > > > > >> > this ordering:
> > > > > >> >
> > > > > >> > T1: trace event fires, get_mm_memcg_path() is called
> > > > > >> > T1: get_memcg_path_buf() returns a buffer pointer
> > > > > >> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> > > > > >> > T1: cgroup_path() is called with the now-freed buffer
> > > > > >>
> > > > > >> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> > > > > >> other examples of trace points using cgroup_ino and no need to
> > > > > >> allocate buffers. Also cgroup namespace might complicate the path
> > > > > >> usage.
> > > > > >
> > > > > > Hmm, so in general I would love to use a numeric identifier instead of a string.
> > > > > >
> > > > > > I did some reading, and it looks like the cgroup_ino() mainly has to
> > > > > > do with writeback, instead of being just a general identifier?
> > > > > > https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > >
> > > > I think you are confusing cgroup inodes with real filesystem inodes in that doc.
> > > >
> > > > > >
> > > > > > There is cgroup_id() which I think is almost what I'd want, but there
> > > > > > are a couple problems with it:
> > > > > >
> > > > > > - I don't know of a way for userspace to translate IDs -> paths, to
> > > > > > make them human readable?
> > > > >
> > > > > The id => name map can be built from user space with a tree walk.
> > > > > Example:
> > > > >
> > > > > $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
> > > > > 20387 init.scope
> > > > > 31 system.slice
> > > > >
> > > > > > - Also I think the ID implementation we use for this is "dense",
> > > > > > meaning if a cgroup is removed, its ID is likely to be quickly reused.
> > > > > >
> > > >
> > > > The ID for cgroup nodes (underlying it is kernfs) are allocated from
> > > > idr_alloc_cyclic() which gives new ID after the last allocated ID and
> > > > wrap after around INT_MAX IDs. So, likeliness of repetition is very
> > > > low. Also the file_handle returned by name_to_handle_at() for cgroupfs
> > > > returns the inode ID which gives confidence to the claim of low chance
> > > > of ID reusing.
> > >
> > > Ah, for some reason I remembered it using idr_alloc(), but you're
> > > right, it does use cyclical IDs. Even so, tracepoints which expose
> > > these IDs would still be difficult to use I think.
> >
> > The writeback tracepoint in include/trace/events/writeback.h is
> > already using the cgroup IDs. Actually it used to use cgroup_path but
> > converted to cgroup_ino.
> >
> > Tejun, how do you use these tracepoints?
> >
> > > Say we're trying to
> > > collect a histogram of lock latencies over the course of some test
> > > we're running. At the end, we want to produce some kind of
> > > human-readable report.
> > >
> >
> > I am assuming the test infra and the tracing infra are decoupled
> > entities and test infra is orchestrating the cgroups as well.
> >
> > > cgroups may come and go throughout the test. Even if we never re-use
> > > IDs, in order to be able to map all of them to human-readable paths,
> > > it seems like we'd need some background process to poll the
> > > /sys/fs/cgroup/memory directory tree as Greg described, keeping track
> > > of the ID<->path mapping. This seems expensive, and even if we poll
> > > relatively frequently we might still miss short-lived cgroups.
> > >
> > > Trying to aggregate such statistics across physical machines, or
> > > reboots of the same machine, is further complicated. The machine(s)
> > > may be running the same application, which runs in a container with
> > > the same path, but it'll end up with different IDs. So we'd have to
> > > collect the ID<->path mapping from each, and then try to match up the
> > > names for aggregation.
> >
> > How about adding another tracepoint in cgroup_create which will output
> > the ID along with the name or path? With a little post processing you
> > can get the same information. Also note that if the test is
> > deleting/creating the cgroup with the same name, you will miss that
> > information if filtering with just path.
> >
> > IMHO cgroup IDs will make the kernel code much simpler with the
> > tradeoff of a bit more work in user space.
>
> I like this idea! I think userspace can use the synthetic trace event
> API to construct an event which includes the strings, like the one
> I've added, if we had this separate ID<->path mapping tracepoint. If
> so, it would be just as easy for userspace to use, but it would let us
> deal with integer IDs everywhere else in the kernel, and keep the
> complexity related to dealing with buffers limited to just one place.
>
> That said, I'd prefer to pursue this as a follow-up thing, rather than
> as part of fixing this bug. Seem reasonable?

SGTM but note that usually Andrew squash all the patches into one
before sending to Linus. If you plan to replace the path buffer with
integer IDs then no need to spend time fixing buffer related bug.