Re: unhare() interface design questions and man page
From: Michael Kerrisk
Date: Thu Feb 16 2006 - 00:48:25 EST
Hi Janak,
> >>>2. Reading the code and your documentation, I see that CLONE_VM
> >>> implies CLONE_SIGHAND. Since CLONE_SIGHAND is not implemented
> >>> (i.e., results in an EINVAL error), I take it that this means
> >>> that at the moment CLONE_VM will not work (i.e., will result
> >>> in EINVAL). Is this correct? If so, I will note this in
> >>> the man page.
> >>>
> >>Actually, CLONE_SIGHAND implies CLONE_VM and not the
> >>otherway around. Currently CLONE_VM is supported, as long as
> >>singnal handlers are not being shared. That is, if you created the
> >>process using clone(CLONE_VM), which kept signal handlers
> >>different, then you can unshare VM using unshare(CLONE_VM).
> >
> >Maybe I'm being dense, and as I said I haven't actually
> >tested the interface, but your Documentation/unshare.txt
> >(7.2) says the opposite:
> >
> > If CLONE_VM is set, force CLONE_SIGHAND.
> >
> >and the code in check_unshare_flags() seems to agree:
> >
> > /*
> > * If unsharing vm, we must also unshare signal handlers
> > */
> > if (*flags_ptr & CLONE_VM)
> > *flags_ptr |= CLONE)SIGHAND;
> >
> >What am I missing?
> >
> Sorry, I think I have caused confusion by my use of "implies" and
> "forcing of flags". I am easily confused by this as well, so let me see
> if I can clarify with a picture. I will double check the documentation
> file to make sure I am using correct and unambiguous words.
>
> Consider the following 4 cases.
>
> (1) (2) (3) (4)
>
>
> SH1 SH2 SH1 SH2 SH SH
> | | \ / || / \
> | | \ / || / \
> VM1 VM2 VM VM VM1 VM2
>
>
> ok ok ok NOT ok
>
>
> You can achieve (1), (2) or (3) using clone(), but clone()
> will not let you do (4). What we want to make sure is
> that we don't endup like (4) using unshare. unshare
> makes sure that if you are trying to unshare vm, AND
> you were sharing signal handlers (case 3) before,
> then it won't allow that operation. However, if you
> were like (2), then you can unshare vm and end up
> like (1), which is allowed. So the forcing of sighand
> flag makes sure that you check for case (2) or (3).
That all makes sense. Thanks. Just returning to my
original statement though: CLONE_VM *does* imply
CLONE_SIGHAND. But that's okay. What I had
overlooked was that unshare(CLONE_SIGHAND) is
permitted, and implemented as a no-op, if the count
of threads sharing the signal structure is 1.
In other words, we can do an unshare(CLONE_SIGHAND)
as long as we did not specifiy CLONE_SIGHAND in the
previous clone() call.
By the way, I have by now done quite a bit of testing
of unshare(), and it works as documented for all the
cases I've tested. I have included a fairly
general (command-line argument driven) program at
the foot of this message. It allows you to test
various flag combinations. Maybe it is useful
to you also?
> >>>3. The naming of the 'flags' bits is inconsistent. In your
> >>> documentation you note:
> >>>
> >>> unshare reverses sharing that was done using clone(2) system
> >>> call, so unshare should have a similar interface as clone(2).
> >>> That is, since flags in clone(int flags, void *stack)
> >>> specifies what should be shared, similar flags in
> >>> unshare(int flags) should specify what should be unshared.
> >>> Unfortunately, this may appear to invert the meaning of the
> >>> flags from the way they are used in clone(2). However,
> >>> there was no easy solution that was less confusing and that
> >>> allowed incremental context unsharing in future without an
> >>> ABI change.
> >>>
> >>> The problem is that the flags don't simply reverse the meanings
> >>> of the clone() flags of the same name: they do it inconsistently.
> >>>
> >>> That is, CLONE_FS, CLONE_FILES, and CLONE_VM *reverse* the
> >>> effects of the clone() flags of the same name, but CLONE_NEWNS
> >>> *has the same meaning* as the clone() flag of the same name.
> >>> If *all* of the flags were simply reversed, that would be
> >>> a little strange, but comprehensible; but the fact that one of
> >>> them is not reversed is very confusing for users of the
> >>> interface.
> >>>
> >>> An idea: why not define a new set of flags for unshare()
> >>> which are synonyms of the clone() flags, but make their
> >>> purpose more obvious to the user, i.e., something like
> >>> the following:
> >>>
> >>> #define UNSHARE_VM CLONE_VM
> >>> #define UNSHARE_FS CLONE_FS
> >>> #define UNSHARE_FILES CLONE_FILES
> >>> #define UNSHARE_NS CLONE_NEWNS
> >>> etc.
> >>>
> >>> This would avoid confusion for the interface user.
> >>> (Yes, I realize that this could be done in glibc, but why
> >>> make the kernel and glibc definitions differ?)
> >>>
> >>I agree that use of clone flags can be confusing. At least a couple of
> >>folks pointed that out when I posted the patch. The issues was even
> >>raised when unshare was proposed few years back on lkml. Some
> >>source of confusion is the special status of CLONE_NEWNS. Because
> >>namespaces are shared by default with fork/clone, it is different than
> >>other CLONE_* flags. That's probably why it was called CLONE_NEWNS
> >>and not CLONE_NS.
> >
> >Yes, most likely.
> >
> >>In the original discussion in Aug'00, Linus
> >>said that "it makes sense that a unshare(CLONE_FILES) basically
> >>_undoes_ the sharing of clone(CLONE_FILES)"
> >>
> >>http://www.ussg.iu.edu/hypermail/linux/kernel/0008.3/0662.html
> >>
> >>So I decided to follow that as a guidance for unshare interface.
> >
> >Yes, but when Linus made that statement (Aug 2000), there
> >was no CLONE_NEWNS (it arrived in 2.4.19, Aug 2002). So
> >the inconsistency that I'm highlighting didn't exist back
> >then. As I said above: if *all* of the flags were simply
> >reversed, that would be comprehensible; but the fact that
> >one of them is not reversed is inconsistent. This &will*
> >confuse people in userland, and it seems quite
> >unnecessary to do that. Please consider this point further.
> >
> Thanks for clarification. I didn't check that namespaces cames after
> that original discussion. I still think that the confusion is not acute
> enough to warrent addition of more flags, but will run it by some
> folks to see what they think.
Let me put it this way: if you change things in the manner
I suggest, then it will cause a few kernel developers
to have to stop and think for a moment. If you leave things
as they are, then a multitude of userland programmers will be
condemned to stumble over this confusion for many years to
come.
(And yes, I appreciate that the original problem arose
with clone(), really there should have been a CLONE_NS
flag which was used as the default for fork() and exec(),
and omitted to get the CLONE_NEWNS behaviour we now have.
But extended this problem into unshare() is even
more confusing, IMHO.)
Just to emphasize this point: while testing the
various unshare() flags, I found I was myself runnng
into confusion when I tested unshare(CLONE_NEWNS).
That confusion arose precisely because CLONE_NEWNS
has the same effect in clone() and unshare(). And
I was still getting confused even though I
understood that!
Please consider changing these names. (I'm a little
surprised that no-one else has offered an opinion for
or against this point so far...)
> >>>4. Would it not be wise to include a check of the following form
> >>> at the entry to sys_unshare():
> >>>
> >>> if (flags & ~(CLONE_FS | CLONE_FILES | CLONE_VM |
> >>> CLONE_NEWNS | CLONE_SYSVSEM | CLONE_THREAD))
> >>> return -EINVAL;
> >>>
> >>> This future-proofs the interface against applications
> >>> that try to specify extraneous bits in 'flags': if those
> >>> bits happen to become meaningful in the future, then the
> >>> application behavior would silently change. Adding this
> >>> check now prevents applications trying to use those bits
> >>> until such a time as they have meaning.
> >>>
> >>I did have a similar check in the first incarnation of the patch. It
> >>was
> >>pointed out, correctly, that it is better to allow all flags so we can
> >>incrementally add new unshare functionality while not making
> >>any ABI changes. unshare follows clone here, which also does not
> >>check for extraneous bits in flags.
> >
> >I guess I need educating here. Several other system calls
> >do include such checks:
> >
> >mm/mlock.c: mlockall(2):
> >if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
> >mm/mprotect.c: mprotect(2):
> >if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
> >mm/msync.c: msync(2):
> >if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> >mm/mremap.c: mremap(2):
> >if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
> >mm/mempolicy.c: mbind(2):
> >if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
> >mm/mempolicy.c: get_mempolicy(2):
> >if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> >
> >What distinguishes unshare() (and clone()) from these?
> >
> I haven't looked at your examples in detail, but basically clone and
> unshare work on pieces of process context. It is quite possible that
> in future there may be new pieces added to the process context
> resulting in new flags. You want to make sure that you can
> incrementally add functionality for sharing and unsharing of
> new flags.
Sure -- and I do not see how my suggestion preclused
that possibility.
> >I guess I'm unclear too about this (requoted) text
> >
> >>It was
> >>pointed out, correctly, that it is better to allow all flags so we can
> >>incrementally add new unshare functionality while not making
> >>any ABI changes.
> >
> >If one restricts the range of flags that are available now
> >(prevents userland from trying to use them), and then
> >permits additional flags later, then from the point of
> >view of the old userland apps, there has been no ABI change.
> >What am I missing here?
> >
> I think the ABI change may occur if the new flag that gets added,
> somehow interacts with an existing flag (just like signal handlers and
> vm) or has a different default behavior (like namespace). I think
> that's why clone and unshare (which mimics the clone interface)
> do not check for unimplmented flags.
What you are saying here doesn't make sense to me. Here is how
I see that an ABI change can occur, and it seems to me
that it is highly undesirable:
1. Under the the current implementation, useland calls
unshare() *accidentally* specifying some additional
bits that currently have no meaning, and do not
cause an EINVAL error.
2. Later, those bits acquire meaning in unshare().
3. As a consequence, the behaviour of the old
binary application changes (perhaps crashes,
perhaps just does something new and strange)
Does this scenario not seem to be a problem to you?
If not, why not?
Cheers,
Michael
/* demo_unshare.c */
#define _GNU_SOURCE
#include <sys/types.h>
#include <string.h>
#include <signal.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#define errMsg(msg) do { perror(msg); } while (0)
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
#define fatalExit(msg) do { fprintf(stderr, "%s\n", msg); \
exit(EXIT_FAILURE); } while (0)
static void
usageError(char *progName)
{
fprintf(stderr, "Usage: %s [clone-arg [unshare-arg]]\n", progName);
#define fpe(str) fprintf(stderr, " " str)
fpe("args can contain the following letters:\n");
fpe(" d - file descriptors (CLONE_FILES)\n");
fpe(" f - file system information (CLONE_FS)\n");
fpe(" n - give child separate namespace (CLONE_NEWNS)\n");
fpe(" s - signal dispositions (CLONE_SIGHAND)\n");
fpe(" e - SV semaphore undo structures (CLONE_SYSVSEM)\n");
fpe(" t - place in same thread group (CLONE_THREAD)\n");
fpe(" v - signal dispositions (CLONE_VM)\n");
exit(EXIT_FAILURE);
} /* usageError */
volatile int glob = 0;
typedef struct { /* For passing info to child startup function */
int fd;
mode_t umask;
int signal;
} ChildParams;
static char *unshareFlags = NULL;
#define SYS_unshare 310
static int
charsToBits(char *flags)
{
char *p;
int cloneFlags;
cloneFlags = 0;
if (flags != NULL) {
for (p = flags; *p != '\0'; p++) {
if (*p == 'd') cloneFlags |= CLONE_FILES;
else if (*p == 'f') cloneFlags |= CLONE_FS;
else if (*p == 'n') cloneFlags |= CLONE_NEWNS;
else if (*p == 's') cloneFlags |= CLONE_SIGHAND;
else if (*p == 'e') cloneFlags |= CLONE_SYSVSEM;
else if (*p == 't') cloneFlags |= CLONE_THREAD;
else if (*p == 'v') cloneFlags |= CLONE_VM;
else usageError("Bad flag");
}
}
return cloneFlags;
} /* charsToBits */
static int /* Startup function for cloned child */
childFunc(void *arg)
{
ChildParams *cp;
int cloneFlags;
cloneFlags = 0;
printf("Child: PID=%ld PPID=%ld\n", (long) getpid(),
(long) getppid());
#define SYS_unshare 310
cloneFlags = charsToBits(unshareFlags);
printf("Child unsharing 0x%x\n\n", cloneFlags);
if (syscall(SYS_unshare, cloneFlags) == -1) errExit("unshare");
cp = (ChildParams *) arg; /* Cast arg to true form */
/* The following changes may affect parent */
glob++; /* May change memory shared with parent */
umask(cp->umask);
if (close(cp->fd) == -1) errExit("child:close");
if (signal(cp->signal, SIG_DFL) == SIG_ERR)
errExit("child:signal");
return 0;
} /* childFunc */
int
main(int argc, char *argv[])
{
const int STACK_SIZE = 65536; /* Stack size for cloned child */
char *stack; /* Start of stack buffer area */
char *stackTop; /* End of stack buffer area */
int cloneFlags; /* Flags for cloning child */
ChildParams cp; /* Passed to child function */
const mode_t START_UMASK = S_IWOTH; /* Initial umask setting */
struct sigaction sa;
int status, s;
pid_t pid;
printf("Parent: PID=%ld PPID=%ld\n", (long) getpid(),
(long) getppid());
if (argc > 2)
unshareFlags = argv[2];
/* Set up an argument structure to be passed to cloned child, and
set some process attributes that will be modified by child */
umask(START_UMASK); /* Initialize umask to some value */
cp.umask = S_IWGRP; /* Child sets umask to this value */
cp.fd = open("/dev/null", O_RDWR); /* Child will close this fd */
if (cp.fd == -1) errExit("open");
cp.signal = SIGTERM; /* Child will change disposition */
if (signal(cp.signal, SIG_IGN) == SIG_ERR) errExit("signal");
/* Initialize clone flags using command-line argument
(if supplied) */
cloneFlags = charsToBits(argv[1]);
printf("Parent clone flags 0x%x\n\n", cloneFlags);
/* Allocate stack for child */
stack = malloc(STACK_SIZE);
if (stack == NULL) errExit("malloc");
stackTop = stack + STACK_SIZE; /* Assume stack grows downwards */
if (clone(childFunc, stackTop, cloneFlags | SIGCHLD, &cp) == -1)
errExit("clone");
/* Wait for child. */
pid = waitpid(-1, &status, 0);
if (pid == -1) { /* Probably because CLONE_THREAD was used */
if (! (cloneFlags & CLONE_THREAD))
errExit("waitpid");
else
sleep(1);
}
if (WIFEXITED(status) && WEXITSTATUS(status) != 0)
fatalExit("Child failed");
printf("\nAfter wait in parent:\n");
printf(" Child PID=%ld; status=0x%x\n", (long) pid, status);
/* Check whether changes made by cloned child have
affected parent */
printf("\nParent - checking process attributes:\n");
printf(" [VM] ");
printf("glob=%d (%s)\n", glob, glob ? "changed" : "unchanged");
printf(" [FS] ");
if (umask(0) != START_UMASK)
printf("umask has changed\n");
else
printf("umask is unchanged\n");
printf(" [FILES] ");
s = write(cp.fd, "Hello world\n", 12);
if (s == -1 && errno == EBADF)
printf("file descriptor %d has been closed "
"(i.e., changed)\n", cp.fd);
else if (s == -1)
printf("write() on file descriptor %d failed (%s) "
"(unexpected!)\n", cp.fd, strerror(errno));
else
printf("write() on file descriptor %d succeeded"
"(i.e., unchanged)\n", cp.fd);
printf(" [SIGHAND] ");
if (sigaction(cp.signal, NULL, &sa) == -1) errExit("sigaction");
if (sa.sa_handler != SIG_IGN)
printf("signal disposition has changed\n");
else
printf("signal disposition is unchanged\n");
exit(EXIT_SUCCESS);
} /* main */
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance?
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/,
read the HOWTOHELP file and grep the source
files for 'FIXME'.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/