Re: For review: documentation of clone3() system call

From: Jann Horn
Date: Mon Oct 28 2019 - 11:12:39 EST

On Fri, Oct 25, 2019 at 6:59 PM Michael Kerrisk (man-pages)
<mtk.manpages@xxxxxxxxx> wrote:
> I've made a first shot at adding documentation for clone3(). You can
> see the diff here:
> clone3()
> The clone3() system call provides a superset of the functionality
> of the older clone() interface. It also provides a number of API
> improvements, including: space for additional flags bits; cleaner
> separation in the use of various arguments; and the ability to
> specify the size of the child's stack area.

You might want to note somewhere that its flags can't be
seccomp-filtered because they're stored in memory, making it
inappropriate to use in heavily sandboxed processes.

> struct clone_args {
> u64 flags; /* Flags bit mask */
> u64 pidfd; /* Where to store PID file descriptor
> (int *) */
> u64 child_tid; /* Where to store child TID,
> in child's memory (int *) */
> u64 parent_tid; /* Where to store child TID,
> in parent's memory (int *) */
> u64 exit_signal; /* Signal to deliver to parent on
> child termination */
> u64 stack; /* Pointer to lowest byte of stack */
> u64 stack_size; /* Size of stack */
> u64 tls; /* Location of new TLS */
> };
> The size argument that is supplied to clone3() should be initialâ
> ized to the size of this structure. (The existence of the size
> argument permits future extensions to the clone_args structure.)
> The stack for the child process is specified via cl_args.stack,
> which points to the lowest byte of the stack area, and

Here and in the comment in the struct above, you say that .stack
"points to the lowest byte of the stack area", but isn't that
architecture-dependent? For most architectures, I think it should
instead be "is the initial stack pointer", with the exception of IA64
(and maybe others, I'm not sure). For example, on X86, when launching
a thread with an initially empty stack, it points directly *after* the
end of the stack area.

> cl_args.stack_size, which specifies the size of the stack in
> bytes. In the case where the CLONE_VM flag (see below) is speciâ

stack_size is ignored on most architectures.

> fied, a stack must be explicitly allocated and specified. Otherâ
> wise, these two fields can be specified as NULL and 0, which
> causes the child to use the same stack area as the parent (in the
> child's own virtual address space).
> Equivalence between clone() and clone3() arguments
> Unlike the older clone() interface, where arguments are passed
> individually, in the newer clone3() interface the arguments are
> packaged into the clone_args structure shown above. This strucâ
> ture allows for a superset of the information passed via the
> clone() arguments.
> The following table shows the equivalence between the arguments of
> clone() and the fields in the clone_args argument supplied to
> clone3():
> clone() clone(3) Notes
> cl_args field
> flags & ~0xff flags
> parent_tid pidfd See CLONE_PIDFD
> child_tid child_tid See CLONE_CHILD_SETTID
> parent_tid parent_tid See CLONE_PARENT_SETTID
> flags & 0xff exit_signal
> stack stack
> --- stack_size

(except that on ia64, stack_size also exists in clone2(), and if
you're not on ia64, stack_size doesn't do anything, at least on X86,
so showing them side by side like this doesn't really make sense)