Re: [PATCH v1 2/2] tests/pid_namespace: add pid_max tests

From: Christian Brauner
Date: Mon Feb 26 2024 - 10:45:46 EST


On Mon, Feb 26, 2024 at 08:30:35AM -0700, Tycho Andersen wrote:
> On Mon, Feb 26, 2024 at 09:57:47AM +0100, Christian Brauner wrote:
> > > > > A small quibble, but I wonder about the semantics here. "You can write
> > > > > whatever you want to this file, but we'll ignore it sometimes" seems
> > > > > weird to me. What if someone (CRIU) wants to spawn a pid numbered 450
> > > > > in this case? I suppose they read pid_max first, they'll be able to
> > > > > tell it's impossible and can exit(1), but returning E2BIG from write()
> > > > > might be more useful.
> > > >
> > > > That's a good idea. But it's a bit tricky. The straightforward thing is
> > > > to walk upwards through all ancestor pid namespaces and use the lowest
> > > > pid_max value as the upper bound for the current pid namespace. This
> > > > will guarantee that you get an error when you try to write a value that
> > > > you would't be able to create. The same logic should probably apply to
> > > > ns_last_pid as well.
> > > >
> > > > However, that still leaves cases where the current pid namespace writes
> > > > a pid_max limit that is allowed (IOW, all ancestor pid namespaces are
> > > > above that limit.). But then immediately afterwards an ancestor pid
> > > > namespace lowers the pid_max limit. So you can always end up in a
> > > > scenario like this.
> > >
> > > I wonder if we can push edits down too? Or an render .effective file, like
> >
> > I don't think that works in the current design? The pid_max value is per
> > struct pid_namespace. And while there is a 1:1 relationship between a
> > child pid namespace to all of its ancestor pid namespaces there's a 1 to
> > many relationship between a pid namespace and it's child pid namespaces.
> > IOW, if you change pid_max in pidns_level_1 then you'd have to go
> > through each of the child pid namespaces on pidns_level_2 which could be
> > thousands. So you could only do this lazily. IOW, compare and possibly
> > update the pid_max value of the child pid namespace everytime it's read
> > or written. Maybe that .effective is the way to go; not sure right now.
>
> I wonder then, does it make sense to implement this as a cgroup thing
> instead, which is used to doing this kind of traversal?
>
> Or I suppose not, since the idea is to get legacy software that's
> writing to pid_max to work?

My personal perspective is that this is not so important. The original
motivation for this had been legacy workloads that expect to only get
pid numbers up to a certain size which would otherwise break. And for
them it doesn't matter whether that setting is applied through pid_max
or via some cgroup setting. All that matters is that they don't get pids
beyond what they expect.

So yes, from my POV we could try and make this a cgroup property. But
we should check with Tejun first whether he'd consider this a useful
addition or not.