Re: [PATCH v1 2/2] tests/pid_namespace: add pid_max tests

From: Aleksandr Mikhalitsyn
Date: Thu Feb 29 2024 - 11:23:01 EST


On Thu, Feb 29, 2024 at 4:14 PM Aleksandr Mikhalitsyn
<aleksandr.mikhalitsyn@xxxxxxxxxxxxx> wrote:
>
> On Mon, Feb 26, 2024 at 4:30 PM Tycho Andersen <tycho@tycho.pizza> wrote:
> >
> > On Mon, Feb 26, 2024 at 09:57:47AM +0100, Christian Brauner wrote:
> > > > > > A small quibble, but I wonder about the semantics here. "You can write
> > > > > > whatever you want to this file, but we'll ignore it sometimes" seems
> > > > > > weird to me. What if someone (CRIU) wants to spawn a pid numbered 450
> > > > > > in this case? I suppose they read pid_max first, they'll be able to
> > > > > > tell it's impossible and can exit(1), but returning E2BIG from write()
> > > > > > might be more useful.
> > > > >
> > > > > That's a good idea. But it's a bit tricky. The straightforward thing is
> > > > > to walk upwards through all ancestor pid namespaces and use the lowest
> > > > > pid_max value as the upper bound for the current pid namespace. This
> > > > > will guarantee that you get an error when you try to write a value that
> > > > > you would't be able to create. The same logic should probably apply to
> > > > > ns_last_pid as well.
> > > > >
> > > > > However, that still leaves cases where the current pid namespace writes
> > > > > a pid_max limit that is allowed (IOW, all ancestor pid namespaces are
> > > > > above that limit.). But then immediately afterwards an ancestor pid
> > > > > namespace lowers the pid_max limit. So you can always end up in a
> > > > > scenario like this.
> > > >
> > > > I wonder if we can push edits down too? Or an render .effective file, like
> > >
> > > I don't think that works in the current design? The pid_max value is per
> > > struct pid_namespace. And while there is a 1:1 relationship between a
> > > child pid namespace to all of its ancestor pid namespaces there's a 1 to
> > > many relationship between a pid namespace and it's child pid namespaces.
> > > IOW, if you change pid_max in pidns_level_1 then you'd have to go
> > > through each of the child pid namespaces on pidns_level_2 which could be
> > > thousands. So you could only do this lazily. IOW, compare and possibly
> > > update the pid_max value of the child pid namespace everytime it's read
> > > or written. Maybe that .effective is the way to go; not sure right now.
>
> Hi Tycho!
>
> >
> > I wonder then, does it make sense to implement this as a cgroup thing
> > instead, which is used to doing this kind of traversal?
> >
> > Or I suppose not, since the idea is to get legacy software that's
> > writing to pid_max to work?
>
> Yes, this is mostly for legacy software that expects host-like
> behavior in the container.
> I know that folks who work on running Android inside the container are
> very-very interested in this.

My colleague, Simon Fels, shared with me:
https://android.googlesource.com/platform/bionic.git/+/refs/heads/main/docs/32-bit-abi.md#is-too-small-for-large-pids

>
> Kind regards,
> Alex
>
> >
> > Tycho