Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
From: Paul Turner
Date: Tue Aug 18 2015 - 00:04:32 EST
Apologies for the repeat. Gmail ate its plain text setting for some
reason. Shame bells.
On Mon, Aug 17, 2015 at 9:02 PM, Paul Turner <pjt@xxxxxxxxxx> wrote:
>
>
> On Wed, Aug 5, 2015 at 7:31 AM, Tejun Heo <tj@xxxxxxxxxx> wrote:
>>
>> Hello,
>>
>> On Wed, Aug 05, 2015 at 11:10:36AM +0200, Peter Zijlstra wrote:
>> > > I've been thinking about it and I'm now convinced that cgroups just is
>> > > the wrong interface to require each application to be programming
>> > > against.
>> >
>> > But people are doing it. So you must give them something. You cannot
>> > just tell them to go away.
>>
>> Sure, more on specifics later, but, first of all, the transition to v2
>> is a gradual process. The new and old hierarchies can co-exist, so
>> nothing forces abrupt transitions. Also, we do want to start as
>> restricted as possible and then widen it gradually as necessary.
>>
>> > So where are the people doing this in this discussion? Or are you
>> > one-sidedly forcing things? IIRC Google was doing this.
>>
>> We've been having those discussions for years in person and on the
>> cgroup mailing list. IIRC, the google case was for blkcg where they
>> have an IO proxy process which wanna issue IOs as different cgroups
>> depending on who's the original issuer. They created multiple
>> threads, put them in different cgroups and bounce the IOs to the
>> matching one; however, this is already pretty silly as they have to
>> bounce IOs to different threads. What makes a lot more sense here is
>> the ability to tag an IO as coming from a specific cgroup (or a
>> process's cgroup) and there was discussion of using an extra field in
>> aio request to indicate this, which is an a lot better solution for
>> the problem, can also express different IO priority and pretty easy to
>> implement.
>>
>
> So we have two major types of use that are relevant to this interface:
>
> 1) Proxy agents. When a control systems want to perform work on behalf of a
> container, they will sometimes move the acting thread into the relevant
> control groups so that it can be accounted on that container's behalf.
> [This is more relevant for non-persistent resources such as CPU time or I/O
> priorities than charges that will outlive the work such as memory
> allocations.]
>
> I agree (1) is at best a bit of a hack and can be worked around on the type
> of time-frame these interfaces move at.
>
> 2) Control within an address-space. For subsystems with fungible resources,
> e.g. CPU, it can be useful for an address space to partition its own
> threads. Losing the capability to do this against the CPU controller would
> be a large set-back for instance. Occasionally, it is useful to share these
> groupings between address spaces when processes are cooperative, but this is
> less of a requirement.
>
> This is important to us.
>
>
>> > The whole libvirt trainwreck also does this (the programming against
>> > cgroups, not the per task thing afaik).
>>
>> AFAIK, libvirt is doing multiple backends anyway and as long as the
>> delegation rules are clear, libvirt managing its own subhierarchy is
>> not a problem. It's an administration software stack which requires
>> fairly close integration with the userland part of operating system.
>>
>> > You also cannot mandate system-disease, not everybody will want to run
>> > that monster. From what I understood last time, Google has no interest
>> > what so ever of using it.
>>
>> But what would require tight coupling of individual applications and
>> something like systemd is the kernel failing to set up a reasonable
>> boundary between management and application interfaces. If the kernel
>> provides a useable API for individual applications to use, they'll
>> program against it and the management part can be whatever. If we
>> fail to do that, individual applications will have to talk to external
>> agent to coordinate access to management interface
>
>
> It's notable here that for a managed system, the agent coordinating access
> *must* be external
>
>>
>> and that's what'll
>> end up creating hard dependency on specific system agents from
>> applications like apache or mysql or whatever. We really don't want
>> that. The kernel *NEEDS* to clearly distinguish those two to prevent
>> that from happening.
>>
>> > > I wrote this in the CAT thread too but cgroups may be an
>> > > okay management / administration interface but is a horrible
>> > > programming interface to be used by individual applications.
>> >
>> > Yeah, I need to catch up on that CAT thread, but the reality is, people
>> > use it as a programming interface, whether you like it or not.
>>
>> And that's one of the major fuck ups on cgroup's part that must be
>> rectified. Look at the interface being proposed there. It's exposing
>> direct hardware details w/o much abstraction which is fine for a
>> system management interface but at the same time it's intended to be
>> exposed to individual applications.
>
>
> FWIW this is something we've had no significant problems managing with
> separate mount mounts and file system protections. Yes, there are some
> potential warts around atomicity; but we've not found them too onerous.
>
> What I don't quite follow here is the assumption that CAT should would be
> necessarily exposed to individual applications? What's wrong with subsystems
> that are primarily intended only for system management agents, we already
> have several of these.
>
>
>>
>> This lack of distinction makes
>> people skip the attention that they should be paying when they're
>> designing interface exposed to individual programs. Worse, this makes
>> these things fly under the review scrutiny that public API accessible
>> to applications usually receives. Yet, that's what these things end
>> up to be. This just has to stop. cgroups can't continue to be this
>> ghetto shortcut to implementing half-assed APIs.
>
>
> I certainly don't disagree on this point :). But as above, I don't quite
> follow why an API being in cgroups must mean it's accessible to an
> application controlled by that group. This has certainly not been a
> requirement for our use.
>
>>
>>
>> > > For things which don't require hierarchy, the obvious thing to do is
>> > > implementing a usual syscall-like interface be it a separate syscall,
>> > > an prctl command, an ioctl or whatever.
>> >
>> > And then you get /proc extensions to observe them, then people make
>> > those /proc extensions writable and before you know it you've got an
>> > equal or bigger mess back than you started out with :-(
>>
>> What we should be doing is pushing them into the same arena as any
>> other publicly accessible API. I don't think there can be a shortcut
>> to this.
>>
>
> Are you explicitly opposed to non-hierarchical partitions, however? Cpuset
> is [typically] an example of this, where the interface wants to control
> unified properties across a set of processes. Without necessarily being
> usefully hierarchical. (This is just to understand your core position, I'm
> not proposing cpuset should shape *anything*.)
>
>>
>> > > For things which require
>> > > building a hierarchy of member threads, the right thing to do is
>> > > making it a part of the usual process hierarchy - this is *the*
>> > > hierarchy that applications are familiar with and have the facilities
>> > > to deal with, so we can, for example, add a clone or unshare flag
>> > > which puts the calling threads in a new child group and then let that
>> > > use the fore-mentioned syscall-like interface to configure whatever it
>> > > wants to configure.
>> >
>> > And then you get to add support to cgroups to migrate hierarchies, is
>> > that complexity you're waiting for?
>>
>> Absolutely, if it comes to that, that's what we should do. The only
>> other option is spilling and getting locked into half-baked interface
>> to applications which not only harm userland but also kernel.
>>
>> > Not to mention that its an unwieldy interface because then you get spawn
>> > spawning threads etc.. Seeing how its impossible for the main thread to
>> > create N tasks in one subgroup and another M tasks in another subgroup.
>> >
>> > Instead they get to spawn a thread A, with which they then need to
>> > communicate to spawn a further N tasks, then spawn a thread B, and again
>> > communicate for another M tasks.
>> >
>> > That's a rather awkward change to how people usually spawn threads.
>>
>> It is within the usual purview of how userland deals with hierarchies
>> of processes / threads and I don't think it's necessarily bad and more
>> importantly I don't think the use case or the perceived awkwardness
>> justifies introducing a wholely new mechanism.
>>
>> > Also, what to do when a thread changes profile? I can imagine a
>> > situation where a task accepts a connection and depending on the kind of
>> > request it gets, gets placed into a certain sub-group.
>>
>> Migration is a very expensive operation. The obvious thing to do for
>> such cases is having pools of workers for different profiles. Also,
>> as mentioned before, for more specific cases like IO, it makes a lot
>> more sense to override things per operation rather than moving threads
>> around.
>>
>> > But there's no migration facility, so you get to go hand the work
>> > around, which is expensive.
>>
>> That's a lot cheaper than migrating.
>>
>> > If there would be a migration facility, you've just lost naming, so how
>> > are you going to denote the subgroups?
>>
>> I don't think we want migration in sub-process hierarchy but in the
>> off chance we do the naming can follow the same pid/program
>> group/session id scheme, which, again, is a lot easier to deal with
>> from applications.
>
>
> I don't have many objections with hand-off versus migration above, however,
> I think that this is a big drawback. Threads are expensive to create and
> are often cached rather than released. While migration may be expensive,
> creating a more thread is more so. The important to reconfigure a thread's
> personality at run-time is important.
>
>>
>> > > In the long term, this is *way* better than
>> > > letting individual applications fumble with cgroup hierarchy
>> > > delegation and pseudo filesystem access.
>> >
>> > You're worried about the intersection between what a task does and what
>> > the administrator does, and that's a valid worry. But I'm really not
>> > convinced this is going to make it better.
>> >
>> > We already have relative file ops (openat(), mkdirat(), unlinkat()
>> > etc..) can't we make sure they do the right thing in the face of a
>> > process (hierarchy) getting migrated by the administrator.
>>
>> But those are relative to the current directory per operation and
>> there's no way to define a transaction across multiple file
>> operations. There's no way to prevent a process from being migrated
>> inbetween openat() and subsequent write().
>
>
> A forwarding /proc/thread_self/cgroup accessor, or similar, would be another
> way to address some of these issues.
>
>>
>>
>> > That way, things at least _can_ work right, and I think being able to do
>> > the right thing trumps not being able to make a mess -- people are
>> > people, they'll always make a mess.
>>
>> It can't, at least not in the usual manner that file system operations
>> are defined. This is an interface which requires central coordination
>> (even for delegation) and a horrible one to expose to individual
>> applications.
>>
>> > > If hierarchical weight and/or bandwidth limiting for thread hierarchy
>> > > is absolutely necessary, doing this shouldn't be too difficult and I
>> > > suspect it wouldn't be all that different from autogroup.
>> >
>> > Autogroups are a bit icky and have the 'advantage' of not intersecting
>> > with regular cgroups (much). The above has intricate intersection with
>> > the cgroup stuff.
>> >
>> > As said, your migrate process becomes a move hierarchy. You further get
>> > more 'hidden' cgroups. /proc files that report what cgroup a task is in
>> > will report a cgroup that's not actually present in the filesystem
>> > (autogroups already does this, it confuses people). And as stated you
>> > take away a lot of things that are now possible.
>>
>> I don't think it's a lot that per-process is gonna take away.
>> Per-thread use cases are pretty niche to begin with and most can and
>> should be implemented better using a more fitting mechanism. As for
>> having to deal with more complexity in cgroup core, that's fine. If
>> it comes to that, we'll have to bite the bullet and do it. Sure, we
>> want to be simpler but not at the cost of messing up userland API and
>> please note that what we lost with cgroups is this tension.
>
>
> I don't quite agree here. Losing per-thread control within the cpu
> controller is likely going to mean that much of it ends up being
> reimplemented as some duplicate-in-appearance interface that gets us back to
> where we are today. I recognize that these controllers (cpu, cpuacct) are
> square pegs in that per-process makes sense for most other sub-systems; but
> unfortunately, their needs and use-cases are real / dependent on their
> present form.
>
>>
>> This tension between the difficulty and complexity of implementing
>> something which can be used by applications and the necessity or
>> desirability of the proposed use cases is crucial in steering kernel
>> development and the APIs it exposes. Abusing cgroups like we've been
>> doing bypasses that tension and we of course end up locked into an
>> extremely crappy interfaces and mechanisms which could never be
>> justified in the first place. This is about time we stopped this
>> disaster train.
>>
>> Thanks.
>>
>> --
>> tejun
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/