Re: KASAN: use-after-free Read in get_mem_cgroup_from_mm

From: Andrea Arcangeli
Date: Tue Mar 05 2019 - 21:05:47 EST


Hello everyone,

[ CC'ed Mike and Peter ]

On Tue, Mar 05, 2019 at 02:42:00PM +0800, zhong jiang wrote:
> On 2019/3/5 14:26, Dmitry Vyukov wrote:
> > On Mon, Mar 4, 2019 at 4:32 PM zhong jiang <zhongjiang@xxxxxxxxxx> wrote:
> >> On 2019/3/4 22:11, Dmitry Vyukov wrote:
> >>> On Mon, Mar 4, 2019 at 3:00 PM zhong jiang <zhongjiang@xxxxxxxxxx> wrote:
> >>>> On 2019/3/4 15:40, Dmitry Vyukov wrote:
> >>>>> On Sun, Mar 3, 2019 at 5:19 PM zhong jiang <zhongjiang@xxxxxxxxxx> wrote:
> >>>>>> Hi, guys
> >>>>>>
> >>>>>> I also hit the following issue. but it fails to reproduce the issue by the log.
> >>>>>>
> >>>>>> it seems to the case that we access the mm->owner and deference it will result in the UAF.
> >>>>>> But it should not be possible that we specify the incomplete process to be the mm->owner.
> >>>>>>
> >>>>>> Any thoughts?
> >>>>> FWIW syzbot was able to reproduce this with this reproducer.
> >>>>> This looks like a very subtle race (threaded reproducer that runs
> >>>>> repeatedly in multiple processes), so most likely we are looking for
> >>>>> something like few instructions inconsistency window.
> >>>>>
> >>>> I has a little doubtful about the instrustions inconsistency window.
> >>>>
> >>>> I guess that you mean some smb barriers should be taken into account.:-)
> >>>>
> >>>> Because IMO, It should not be the lock case to result in the issue.
> >>> Since the crash was triggered on x86 _most likley_ this is not a
> >>> missed barrier. What I meant is that one thread needs to executed some
> >>> code, while another thread is stopped within few instructions.
> >>>
> >>>
> >> It is weird and I can not find any relationship you had said with the issue.:-(
> >>
> >> Because It is the cause that mm->owner has been freed, whereas we still deference it.
> >>
> >> From the lastest freed task call trace, It fails to create process.
> >>
> >> Am I miss something or I misunderstand your meaning. Please correct me.
> > Your analysis looks correct. I am just saying that the root cause of
> > this use-after-free seems to be a race condition.
> >
> >
> >
> Yep, Indeed, I can not figure out how the race works. I will dig up further.

Yes it's a race condition.

We were aware about the non-cooperative fork userfaultfd feature
creating userfaultfd file descriptor that gets reported to the parent
uffd, despite they belong to mm created by failed forks.

https://www.spinics.net/lists/linux-mm/msg136357.html

The fork failure in my testcase happened because of signal pending
that interrupted fork after the failed-fork uffd context, was already
pushed to the userfaultfd reader/monitor. CRIU then takes care of
filtering the failed fork cases so we didn't want to make the fork
code more complicated just for userfaultfd.

In reality if MEMCG is enabled at build time, mm->owner maintainance
code now creates a race condition in the above case, with any fork
failure.

I pinged Mike yesterday to ask if my theory could be true for this bug
and one solution he suggested is to do the userfaultfd_dup at a point
where fork cannot fail anymore. That's precisely what we were
wondering to do back then to avoid the failed fork reports to the
non cooperative uffd monitor.

That will solve the false positive deliveries that CRIU manager
currently filters out too. From a theoretical standpoint it's also
quite strange to even allow any uffd ioctl to run on a otherwise long
gone mm created for a process that in the end wasn't even created (the
mm got temporarily fully created, but no child task really ever used
such mm). However that mm is on its way to exit_mmap as soon as the
ioclt returns and this only ever happens during race conditions, so
the way CRIU monitor works there wasn't anything fundamentally
concerning about this detail, despite it's remarkably "strange". Our
priority was to keep the fork code as simple as possible and keep
userfaultfd as non intrusive as possible.

One alternative solution I'm wondering about for this memcg issue is
to free the task struct with RCU also when fork has failed and to add
the mm_update_next_owner before mmput. That will still report failed
forks to the uffd monitor, so it's not the ideal fix, but since it's
probably simpler I'm posting it below. Also I couldn't reproduce the
problem with the testcase here yet.