Re: [PATCH v12 8/8] cgroup: implement the PIDs subsystem

From: Tejun Heo
Date: Thu May 28 2015 - 16:33:29 EST

Next message: Sebastian Reichel: "Re: randconfig build error with next-20150528, in drivers/power/twl4030_charger.c"
Previous message: Dan Streetman: "[PATCH] frontswap: allow multiple backends"
Next in thread: Tejun Heo: "Re: [PATCH v12 8/8] cgroup: implement the PIDs subsystem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello, sorry about the delay.

On Tue, May 19, 2015 at 12:56:31PM +0200, Peter Zijlstra wrote:
> > This has been discussed before. Organisational operations (i.e.
> > attaching to a cgroup) are not to be blocked by a cgroup controller in
> > the unified hierarchy.
>
> That's utterly insane. As argued at length in threads like:
>
> lkml.kernel.org/r/alpine.DEB.2.11.1505061100040.4225@nanos
>
> This breaks fundamental control rules and makes life for a number of
> controllers impossible.

I didn't chase that dicussion because it was rather off-topic for
scheduler.

There are several classes of distribution schemes that cgroups deal
with.

A. Ratio-based. Usually used to distributed resources which are
replenished over time. IO time, CPU cycles and so on. This
primarily doesn't deal with persistent state.

B. Limiting over-committable resources. This applies to persistent
resources like memory but also to transient ones like IO bandwidth
and iops. These all operate by limiting how much resources are
newly given out and thus their neutral state is the overcommitted
no-limit state.

C. Non-over-committable "hard" resources. Currently, scheduler RT
slices are the only one. These actually should be distributed by
carving out a finite whole and thus its limits can't be
over-committed. They have to behave as allocators rather than
limiters.

Most persistent resources fall in the B category and we have a very
clear precedences in dealing with configurations of these limits.
Just think about the NPROC ulimit or quota. They all operate by
suppressing distribution of new resources and allow new limit
configuration to be lower than the current consumption.

There's a clear reason for this. it allows closing the race window
between configuration change and increasing resource consumption in a
very simple way - lowering the limit and checking the existing usage.

While what Thomas suggested - building a whole new transaction model
on top - can also close the race window. This breaks from the
convention for no good reason. It doesn't provide anything beyond
what's what's possible with the established model and it's outright
silly to have NPROC controller to behave so differently from the
existing mechanism which controls exactly the same resource.

> Also, I'll NAK each and every patch that will attempt to remove failing
> can_attach from the cgroup core as it will fundamentally break some
> scheduler controllers.

I was struggling with C above because it was just a single resource
type which belongs to that category but given that cgroups have to
support it ->can_attach() will have to be able to fail for those
resource types, but only for that resource type.

> So please use it, it doesn't make any bloody sense to 'control' the
> number of PIDs but then allow it to overrun the set point.

Again, it's not about ->can_attach() can fail or not in terms of
implementation at all. It's about following consistent resource
distribution model. Please don't conflate different resource types.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Sebastian Reichel: "Re: randconfig build error with next-20150528, in drivers/power/twl4030_charger.c"
Previous message: Dan Streetman: "[PATCH] frontswap: allow multiple backends"
Next in thread: Tejun Heo: "Re: [PATCH v12 8/8] cgroup: implement the PIDs subsystem"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]