Re: [Linux] Linux PID algorithm is BRAINDEAD!

From: Theodore Ts'o
Date: Sat Oct 10 2015 - 17:58:58 EST


On Fri, Oct 09, 2015 at 10:00:34PM -0400, Dave Goel wrote:
>
> All that the entire system needs is one queue of free PIDs. Any time you
> need a PID, take it from the head. Any time a PID is newly freed, push it at
> the back of the queue. That's it! The overhead seems minimal to me.
>
> The queue is initially populated by 2-32768, of course.

The worst-case overhead is 64k -- 2 bytes times 32k pid's. You can
use a 64k circular buffer to store the list of pid's to use, sure. So
the RAM utilization isn't _that_ bad, except that you need to keep one
of these for each pid namespace. So for systems using a large number
of containers and lots of pid namespaces for isolation purposes, the
memory utilization overhead is not necessarily going to be considered
cheap.

But that's actually not be biggest problem. The biggest problem is
that accessing this free pid queue is now a locking bottleneck ---
especially on a very large NUMA system, and especially on a system
where people are running tons of shell scripts are launching processes
all the time. So in other words, those systems which are most likely
to suffer from pid workaround will also be the systems that will be
punished the most by needing to take a lock each time you allocate a
new pid.

Given that there *are* people who use Linux on systems with hundreds
of CPU's, where global locks are exquisitely painful, the solution
you've outlined is not something that could be the only solution
available in the kernel.

In addition, most people don't find the the workarounds to be terribly
onerous. Using "trap" to catch signals and then having the shell
script clean up after itself (so you don't need to depend on a cleaner
program) is not terribly hard, and in is considered best practice.

So adding something complex into the kernel just to work around sloppy
shell scripts doesn't seem like something that most people would
considered a great use of resources. And if it causes significant
performance regressions on kernel scalability for large NUMA systems,
people will probably be even less interested in implementing something
just for the convenience of sloppy shell script programmers. Telling
kernel developers that Linux's PID algorithm is braindead isn't going
to help. :-)


So what to do instead? I'm going to assume you are in an environment
where you have a huge number of legacy shell scripts and fixing them
all is too hard (tm). What then? Well, the fact that you are talking
about running some kind of task cleaner means that in all likelihood
it's running in some kind of structured environment where you know
that temp files will have a certain format.

So in that world, what I'd suggest is that you have all of the jobs be
started from a management daemon. For the sake of argument, let's
call that management daemon a "borglet"[1]. The borglet starts each
of your jobs running on your machine, so it knows when the job exits,
and when it does, you can just have the borglet delete the job's
entire task directory. For bonus points you could have the borglet
use container technology to control the amount of cpu, memory,
networing, and disk time used by a particular job, but if you need all
of that functionality, it might simpler for you to grab Kubernetes and
just use that to control your jobs. :-)

[1] http://thenewstack.io/google-lifts-the-veil-on-borg-revealing-apache-auroras-heritage/

The advantage of using Kubernetes is that when you're ready to take a
leap into cloud computing, it will be a lot less work since you will
have already structurered your jobs in a way that makes this easy. :-)

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/