Re: Cgroups "pids" controller does not update "pids.current" count immediately

From: Tejun Heo
Date: Fri Jun 15 2018 - 11:41:58 EST


On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote:
> The standard RLIMIT_NPROC does not suffer from such accounting
> discrepancies at any time.

RLIMIT_NPROC uses a dedicated atomic counter which is updated when the
process is getting reaped; however, that doesn't actually coincide
with the pid being freed. The base pid ref is put then but there can
be other refs and even after that it has to go through RCU grace
period to be actually freed.

They seem equivalent but serve a bit different purposes. RLIMIT_NPROC
is primarily about limiting what the user can do and doesn't guarantee
that that actually matches resource (pid here) consumption. pid
controller's primary role is limiting pid consumption - ie. no matter
what happens the cgroup must not be able to take away more than the
specified number from the available pool, which has to account for the
lazy release and draining refs and stuff.

> The "memory" cgroups controller also does
> not suffer from any discrepancies -- it accounts memory usage in
> real time without any lag on process start or exit. The "tasks" file
> list is also always up-to-date.

The memory controller does the same thing, actually way more
extensively. It's just less noticeable because people generally don't
try to control at individual page level.

> Is it really technically not possible to make "pids.current" do
> accounting properly like RLIMIT_NPROC does? We were hoping to
> replace RLIMIT_NPROC with the "pids" controller.

It is of course possible but at a cost. The cost (getting rid of lazy
release optimizations) is just not justifiable for most cases.