Re: [Documentation] State of CPU controller in cgroup v2

From: Tejun Heo
Date: Mon Aug 29 2016 - 18:21:15 EST

Next message: Chen Gang: "Re: [PATCH] arch: all: include: asm: bitops: Use bool instead of int for all bit test functions"
Previous message: Kani, Toshimitsu: "Re: [PATCH 2/3] acpi, nfit: add dimm device notification support"
Next in thread: Andy Lutomirski: "Re: [Documentation] State of CPU controller in cgroup v2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello, Andy.

Sorry about the delay. Was kinda overwhelmed with other things.

On Sat, Aug 20, 2016 at 11:45:55AM -0700, Andy Lutomirski wrote:
> > This becomes clear whenever an entity is allocating memory on behalf
> > of someone else - get_user_pages(), khugepaged, swapoff and so on (and
> > likely userfaultfd too). When a task is trying to add a page to a
> > VMA, the task might not have any relationship with the VMA other than
> > that it's operating on it for someone else. The page has to be
> > charged to whoever is responsible for the VMA and the only ownership
> > which can be established is the containing mm_struct.
>
> This surprises me a bit. If I do access_process_vm(), then I would
> have expected the charge to go the caller, not the mm being accessed.

It does and should go the target mm. Who faults in a page shouldn't
be the final determinant in the ownership; otherwise, we end up in
situations where the ownership changes due to, for example,
fluctuations in page fault pattern. It doesn't make semantical sense
either. If a kthread is doing PIO for a process, why would it get
charged for the memory it's faulting in?

> What happens if a program calls read(2), though? A page may be
> inserted into page cache on behalf of an address_space without any
> particular mm being involved. There will usually be a calling task,
> though.

Most faults are synchronous and the faulting thread is a member of the
mm to be charged, so this usually isn't an issue. I don't think there
are places where we populate an address_space without knowing who it
is for (as opposed / in addition to who the operator is).

> But this is all very memcg-specific. What about other cgroups? I/O
> is per-task, right? Scheduling is definitely per-task.

They aren't separate. Think about IOs to write out page cache, CPU
cycles spent reclaiming memory or encrypting writeback IOs. It's fine
to get more granular with specific resources but the semantics gets
messy for cross-resource accounting and control without proper
scoping.

> > Consider the scenario where you have somebody faulting on behalf of a
> > foreign VMA, but the thread who created and is actively using that VMA
> > is in a different cgroup than the process leader. Who are we going to
> > charge? All possible answers seem erratic.
>
> Indeed, and this problem is probably not solvable in practice unless
> you charge all involved cgroups. But the caller's *mm* is entirely
> irrelevant here, so I don't see how this implies that cgroups need to
> keep tasks in the same process together. The relevant entities are
> the calling *task* and the target mm, and you're going to be
> hard-pressed to ensure that they belong to the same cgroup, so I think
> you need to be able handle weird cases in which there isn't an
> obviously correct cgroup to charge.

It is an erratic case which is caused by userland interface allowing
non-sensical configuration. We can accept it as a necessary trade-off
given big enough benefits or unavoidable constraints but it isn't
something to do willy-nilly.

> > For system-level and process-level operations to not step on each
> > other's toes, they need to agree on the granularity boundary -
> > system-level should be able to treat an application hierarchy as a
> > single unit. A possible solution is allowing rgroup hirearchies to
> > span across process boundaries and implementing cgroup migration
> > operations which treat such hierarchies as a single unit. I'm not yet
> > sure whether the boundary should be at program groups or rgroups.
>
> I think that, if the system cgroup manager is moving processes around
> after starting them and execing the final binary, there will be races
> and confusion, and no about of granularity fiddling will fix that.

I don't see how that statement is true. For example, if you confine
the hierarhcy to in-process, there is proper isolation and whether
system agent migrates the process or not doesn't make any difference
to the internal hierarchy.

> I know nothing about rgroups. Are they upstream?

It was linked from the original message.

[7] http://lkml.kernel.org/r/20160105154503.GC5995@xxxxxxxxxxxxxxx
[RFD] cgroup: thread granularity support for cpu controller
Tejun Heo <tj@xxxxxxxxxx>

[8] http://lkml.kernel.org/r/1457710888-31182-1-git-send-email-tj@xxxxxxxxxx
[PATCHSET RFC cgroup/for-4.6] cgroup, sched: implement resource group and PRIO_RGRP
Tejun Heo <tj@xxxxxxxxxx>

[9] http://lkml.kernel.org/r/20160311160522.GA24046@xxxxxxxxxxxxxxx
Example program for PRIO_RGRP
Tejun Heo <tj@xxxxxxxxxx>

> > These base-system operations are special regardless of cgroup and we
> > already have sometimes crude ways to affect their behaviors where
> > necessary through sysctl knobs, priorities on specific kernel threads
> > and so on. cgroup doesn't change the situation all that much. What
> > gets left in the root cgroup usually are the base-system operations
> > which are outside the scope of cgroup resource control in the first
> > place and cgroup resource graph can treat the root as an opaque anchor
> > point.
>
> This seems to explain why the controllers need to be able to handle
> things being charged to the root cgroup (or to an unidentifiable
> cgroup, anyway). That isn't quite the same thing as allowing, from an
> ABI point of view, the root cgroup to contain processes and cgroups
> but not allowing other cgroups to do the same thing. Consider:

The points are 1. we need the root to be a special container anyway
2. allowing it to be special and contain system-wide consumptions
doesn't make the resource graph inconsistent once all non-system-wide
consumptions are put in non-root cgroups, and 3. this is the most
natural way to handle the situation both from implementation and
interface standpoints as it makes non-cgroup configuration a natural
degenerate case of cgroup configuration.

> suppose that systemd (or some competing cgroup manager) is designed to
> run in the root cgroup namespace. It presumably expects *itself* to
> be in the root cgroup. Now try to run it using cgroups v2 in a
> non-root namespace. I don't see how it can possibly work if it the
> hierarchy constraints don't permit it to create sub-cgroups while it's
> still in the root. In fact, this seems impossible to fix even with
> user code changes. The manager would need to simultaneously create a
> new child cgroup to contain itself and assign itself to that child
> cgroup, because the intermediate state is illegal.

Please re-read the constraint. It doesn't prevent any organizational
operations before resource control is enabled.

> I really, really think that cgroup v2 should supply the same
> *interface* inside and outside of a non-root namespace. If this is

It *does*. That's what I tried to explain, that it's exactly
isomorhpic once you discount the system-wide consumptions.

Thanks.

--
tejun

Next message: Chen Gang: "Re: [PATCH] arch: all: include: asm: bitops: Use bool instead of int for all bit test functions"
Previous message: Kani, Toshimitsu: "Re: [PATCH 2/3] acpi, nfit: add dimm device notification support"
Next in thread: Andy Lutomirski: "Re: [Documentation] State of CPU controller in cgroup v2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]