Re: [PATCH v1 2/2] tests/pid_namespace: add pid_max tests

From: Tycho Andersen
Date: Sat Feb 24 2024 - 10:00:11 EST


On Fri, Feb 23, 2024 at 05:24:03PM +0100, Christian Brauner wrote:
> On Thu, Feb 22, 2024 at 09:54:08AM -0700, Tycho Andersen wrote:
> > On Thu, Feb 22, 2024 at 05:09:15PM +0100, Alexander Mikhalitsyn wrote:
> > > +static int pid_max_nested_limit_inner(void *data)
> > > +{
> > > + int fret = -1, nr_procs = 400;
> > > + int fd, ret;
> > > + pid_t pid;
> > > + pid_t pids[1000];
> > > +
> > > + ret = mount("", "/", NULL, MS_PRIVATE | MS_REC, 0);
> > > + if (ret) {
> > > + fprintf(stderr, "%m - Failed to make rootfs private mount\n");
> > > + return fret;
> > > + }
> > > +
> > > + umount2("/proc", MNT_DETACH);
> > > +
> > > + ret = mount("proc", "/proc", "proc", 0, NULL);
> > > + if (ret) {
> > > + fprintf(stderr, "%m - Failed to mount proc\n");
> > > + return fret;
> > > + }
> > > +
> > > + fd = open("/proc/sys/kernel/pid_max", O_RDWR | O_CLOEXEC | O_NOCTTY);
> > > + if (fd < 0) {
> > > + fprintf(stderr, "%m - Failed to open pid_max\n");
> > > + return fret;
> > > + }
> > > +
> > > + ret = write(fd, "500", sizeof("500") - 1);
> > > + close(fd);
> > > + if (ret < 0) {
> > > + fprintf(stderr, "%m - Failed to write pid_max\n");
> > > + return fret;
> > > + }
> > > +
> > > + for (nr_procs = 0; nr_procs < 500; nr_procs++) {
> > > + pid = fork();
> > > + if (pid < 0)
> > > + break;
> > > +
> > > + if (pid == 0)
> > > + exit(EXIT_SUCCESS);
> > > +
> > > + pids[nr_procs] = pid;
> > > + }
> > > +
> > > + if (nr_procs >= 400) {
> > > + fprintf(stderr, "Managed to create processes beyond the configured outer limit\n");
> > > + goto reap;
> > > + }
> >
> > A small quibble, but I wonder about the semantics here. "You can write
> > whatever you want to this file, but we'll ignore it sometimes" seems
> > weird to me. What if someone (CRIU) wants to spawn a pid numbered 450
> > in this case? I suppose they read pid_max first, they'll be able to
> > tell it's impossible and can exit(1), but returning E2BIG from write()
> > might be more useful.
>
> That's a good idea. But it's a bit tricky. The straightforward thing is
> to walk upwards through all ancestor pid namespaces and use the lowest
> pid_max value as the upper bound for the current pid namespace. This
> will guarantee that you get an error when you try to write a value that
> you would't be able to create. The same logic should probably apply to
> ns_last_pid as well.
>
> However, that still leaves cases where the current pid namespace writes
> a pid_max limit that is allowed (IOW, all ancestor pid namespaces are
> above that limit.). But then immediately afterwards an ancestor pid
> namespace lowers the pid_max limit. So you can always end up in a
> scenario like this.

I wonder if we can push edits down too? Or an render .effective file, like
cgroups, though I prefer just putting the right thing in pid_max.

Tycho