Re: [PATCH v2 1/2] fork: extend clone3() to support CLONE_SET_TID

From: Adrian Reber
Date: Fri Aug 02 2019 - 11:10:33 EST


On Fri, Aug 02, 2019 at 03:50:54PM +0200, Christian Brauner wrote:
> On Fri, Aug 02, 2019 at 03:30:01PM +0200, Oleg Nesterov wrote:
> > On 08/02, Christian Brauner wrote:
> > >
> > > On Wed, Jul 31, 2019 at 06:12:22PM +0200, Adrian Reber wrote:
> > > > The main motivation to add CLONE_SET_TID to clone3() is CRIU.
> > > >
> > > > To restore a process with the same PID/TID CRIU currently uses
> > > > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> > > > ns_last_pid and then (quickly) does a clone(). This works most of the
> > > > time, but it is racy. It is also slow as it requires multiple syscalls.
> > >
> > > Can you elaborate how this is racy, please. Afaict, CRIU will always
> > > usually restore in a new pid namespace that it controls, right?
> >
> > Why? No. For example you can checkpoint (not sure this is correct word)
> > a single process in your namespace, then (try to restore) it.
> >
> > > What is
> > > the exact race?
> >
> > something else in the same namespace can fork() right after criu writes
> > the pid-for-restore into ns_last_pid.
>
> Ok, that makes sense. :)
> My CRIU userspace knowledge is sporadic, so I'm not sure how exactly it
> restores process trees in pid namespaces and what workloads this would
> especially help with.

Just what Oleg said. CRIU can restore processes in a new PID namespaces
or in an existing. To restore a process into an existing PID namespace
has the possibility of a PID collision, but if the PID is not yet in use
there is no limitation from CRIU's side.

Restoring into an existing PID namespace which is used by other
processes always has the possibility that between writing to
/proc/sys/kernel/ns_last_pid and clone() something else has fork()'d and
therefore it is racy.

Adrian