Re: [PATCH] pidns: Make pid_max per namespace

From: Andrew Morton
Date: Thu Mar 10 2011 - 05:46:54 EST


On Thu, 10 Mar 2011 13:06:48 +0300 Pavel Emelyanov <xemul@xxxxxxxxxxxxx> wrote:

> On 03/10/2011 12:50 PM, Andrew Morton wrote:
> > On Thu, 10 Mar 2011 12:35:32 +0300 Pavel Emelyanov <xemul@xxxxxxxxxxxxx> wrote:
> >
> >> On 03/08/2011 02:58 AM, Andrew Morton wrote:
> >>> On Thu, 03 Mar 2011 11:39:17 +0300
> >>> Pavel Emelyanov <xemul@xxxxxxxxxxxxx> wrote:
> >>>
> >>>> Rationale:
> >>>>
> >>>> On x86_64 with big ram people running containers set pid_max on host to
> >>>> large values to be able to launch more containers. At the same time
> >>>> containers running 32-bit software experience problems with large pids - ps
> >>>> calls readdir/stat on proc entries and inode's i_ino happen to be too big
> >>>> for the 32-bit API.
> >>>>
> >>>> Thus, the ability to limit the pid value inside container is required.
> >>>>
> >>>
> >>> This is a behavioural change, isn't it? In current kernels a write to
> >>> /proc/sys/kernel/pid_max will change the max pid on all processes.
> >>> After this change, that write will only affect processes in the current
> >>> namespace. Anyone who was depending on the old behaviour might run
> >>> into problems?
> >>
> >> Hardly. If the behavior of some two apps depends on its synchronous change,
> >> these two might want to run in the same pid namespace.
> >
> > I don't understand your answer. What is this "synchronous change" of which
> > you speak? Does your "might want to run" suggestion mean that userspace
> > changes would be required for this operation to again work correctly?
>
> Your concern was about "anyone who was depending on the old behaviour", where
> the old behavior meant "a write to sys.pid_max will change the max pid on all
> processes".
>
> I wanted to say, that if someone changes pid_max and expects someone else to
> act differently after this, then these two should live in the same pid namespace.

So it's a non-back-compatible change to the userspace interface. uh-oh.

> IOW, if X raises the pid_max, then all the processes X sees in its pid namespace
> *may* have pids up to this value. All the other process, that are not visible
> in X's pid space will have other values, but X doesn't see them, so why should
> we care?

Current userspace has no *need* to be running in the same pidns to
alter the pid_max of some processes. So the chances are good that
any current userspace takes advantage of this.

Silly example:

if (fork() == 0) {
/* child */
create_new_pidns();
start_doing_stuff();
} else {
/* parent */
increase_pid_max();
}

Another example would be logging into a system as root in the init_ns
and modifying /proc/sys/kernel/pid_max by hand.

I don't have a clue how much code is out there using pid namespaces,
not how much of that code alters the default pid_max. Hard.


The proposed interface is a bit weird and hacky anyway, isn't it? We
have a single pseudo-file in a well-known location -
/proc/sys/kernel/pid_max. One would expect alteration of that
system-wide file to have system-wide effects, only that isn't the case.
Instead a modification to the system-wide file has local-pidns-only
effects. It would be much more logical to have a per-pidns pid_max
pseudo file.

And if we do that, we then need to work out what to do with writes to
/proc/sys/kernel/pid_max. Remember the user expects those writes to
alter all processes on the machine! I guess it would be acceptable to
permit that to continue to happen - a write to /proc/sys/kernel/pid_max
will overwrite all the per-pidns pid_max settings.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/