Re: [PATCH RFC v1] pidfd: fix a race in setting exit_state for pidfd polling

From: Joel Fernandes
Date: Wed Jul 17 2019 - 16:48:04 EST


On Wed, Jul 17, 2019 at 11:09:59AM -0700, Suren Baghdasaryan wrote:
> On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <christian@xxxxxxxxxx> wrote:
> >
> > On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote:
> > > From: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > >
> > > There is a race between reading task->exit_state in pidfd_poll and writing
> > > it after do_notify_parent calls do_notify_pidfd. Expected sequence of
> > > events is:
> > >
> > > CPU 0 CPU 1
> > > ------------------------------------------------
> > > exit_notify
> > > do_notify_parent
> > > do_notify_pidfd
> > > tsk->exit_state = EXIT_DEAD
> > > pidfd_poll
> > > if (tsk->exit_state)
> > >
> > > However nothing prevents the following sequence:
> > >
> > > CPU 0 CPU 1
> > > ------------------------------------------------
> > > exit_notify
> > > do_notify_parent
> > > do_notify_pidfd
> > > pidfd_poll
> > > if (tsk->exit_state)
> > > tsk->exit_state = EXIT_DEAD
> > >
> > > This causes a polling task to wait forever, since poll blocks because
> > > exit_state is 0 and the waiting task is not notified again. A stress
> > > test continuously doing pidfd poll and process exits uncovered this bug,
> > > and the below patch fixes it.
> > >
> > > To fix this, we set tsk->exit_state before calling do_notify_pidfd.
> > >
> > > Cc: kernel-team@xxxxxxxxxxx
> > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> >
> > That means in such a situation other users will see EXIT_ZOMBIE where
> > they didn't see that before until after the parent failed to get
> > notified.
>
> I'm a little nervous about that myself even though in my stress
> testing this worked fine. I think the safest change would be to move
> do_notify_pidfd() out of do_notify_parent() and call it after
> tsk->exit_state is finalized. The downside of that is that there are 4

My initial approach to pidfd polling did it this way, and I remember there
was a break in semantics where this does not work well. We want the
notification to happen in do_notify_parent() so that it is in sync with the
wait APIs..

I don't see a risk with this patch though. But let us see what Oleg's eyes
find.

> places we call do_notify_parent(), so instead of calling
> do_notify_pidfd() one time from do_notify_parent() we will be calling
> it 4 times now.
>
> Also my original patch had memory barriers to ensure correct ordering
> of tsk->exit_state writes before reads. In this final version Joel
> removed them, so I suppose he found out they are not needed. Joel,
> please clarify.

The barriers were initially add by me to your patch, but then I felt it may
not be needed so I removed them before sending the patch. My initial concern
was something like the following:

CPU 0 CPU 1
------------------------------------------------
store tsk->exit_state = 1
/* smp_wmb() ? */
do_notify_parent
wake up
poll_wait()
/* smp_rmb(); ? */
read tsk->exit_state = 0
block...


I was initially concerned if tsk->exit_state write would be missed by the
polling thread and we would block forever (similar to this bug).

I don't think this is possible anymore since wakeup implies release-barrier
and waiting implies acquire barrier AFAIU. I am still not fully sure though,
so yeah if you guys think it is an issue, let us add the memory barriers. As
such I know memory barrier additions to the kernel requires justification,
otherwise Linus calls it "Voodoo programming". So let us convince ourself
first if memory barriers are needed before adding them anyway.

thanks,

- Joel




> Thanks!
>
> > That's a rather subtle internal change. I was worried about
> > __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it
> > seems to me that this is fine since we hold write_lock_irq(&tasklist_lock);
> > at the point when we do set p->exit_signal.
> >
> > Acked-by: Christian Brauner <christian@xxxxxxxxxx>
> >
> > Once Oleg confirms that I'm right not to worty I'll pick this up.
> >
> > Thanks!
> > Christian
> >
> > >
> > > ---
> > > kernel/exit.c | 8 +++++---
> > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > index a75b6a7f458a..740ceacb4b76 100644
> > > --- a/kernel/exit.c
> > > +++ b/kernel/exit.c
> > > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> > > if (group_dead)
> > > kill_orphaned_pgrp(tsk->group_leader, NULL);
> > >
> > > + tsk->exit_state = EXIT_ZOMBIE;
> > > if (unlikely(tsk->ptrace)) {
> > > int sig = thread_group_leader(tsk) &&
> > > thread_group_empty(tsk) &&
> > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
> > > ptrace_unlink(p);
> > >
> > > /* If parent wants a zombie, don't release it now */
> > > - state = EXIT_ZOMBIE;
> > > + p->exit_state = EXIT_ZOMBIE;
> > > if (do_notify_parent(p, p->exit_signal))
> > > - state = EXIT_DEAD;
> > > - p->exit_state = state;
> > > + p->exit_state = EXIT_DEAD;
> > > +
> > > + state = p->exit_state;
> > > write_unlock_irq(&tasklist_lock);
> > > }
> > > if (state == EXIT_DEAD)
> > > --
> > > 2.22.0.657.g960e92d24f-goog
> > >