Re: [Documentation] State of CPU controller in cgroup v2

From: Andy Lutomirski
Date: Fri Sep 16 2016 - 14:20:13 EST

On Fri, Sep 16, 2016 at 9:50 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Fri, Sep 16, 2016 at 09:29:06AM -0700, Andy Lutomirski wrote:
>> > SCHED_DEADLINE, its a 'Global'-EDF like scheduler that doesn't support
>> > CPU affinities (because that doesn't make sense). The only way to
>> > restrict it is to partition.
>> >
>> > 'Global' because you can partition it. If you reduce your system to
>> > single CPU partitions you'll reduce to P-EDF.
>> >
>> > (The same is true of SCHED_FIFO, that's a 'Global'-FIFO on the same
>> > partition scheme, it however does support sched_affinity, but using it
>> > gives 'interesting' schedulability results -- call it a historic
>> > accident).
>> Hmm, I didn't realize that the deadline scheduler was global. But
>> ISTM requiring the use of "exclusive" to get this working is
>> unfortunate. What if a user wants two separate partitions, one using
>> CPUs 1 and 2 and the other using CPUs 3 and 4 (with 5 reserved for
>> non-RT stuff)?
> {1,2} {3,4} {5} seem exclusive, did I miss something? (other than that 5
> cpu parts are 'rare').

There's no overlap, so they're logically exclusive, but it avoids
needing the "cpu_exclusive" parameter. It always seemed confusing to
me that a setting on a child cgroup would strictly remove a resource
from the parent. (To be clear: I don't have any particularly strong
objection to cpu_exclusive. It just always seemed like a bit of a
hack that mostly duplicated what you could get by just setting the
cpusets appropriately throughout the hierarchy.)

>> > Note that related, but differently, we have the isolcpus boot parameter
>> > which creates single CPU partitions for all listed CPUs and gives the
>> > rest to the root cpuset. Ideally we'd kill this option given its a boot
>> > time setting (for something which is trivially to do at runtime).
>> >
>> > But this cannot be done, because that would mean we'd have to start with
>> > a !0 cpuset layout:
>> >
>> > '/'
>> > load_balance=0
>> > / \
>> > 'system' 'isolated'
>> > cpus=~isolcpus cpus=isolcpus
>> > load_balance=0
>> >
>> > And start with _everything_ in the /system group (inclding default IRQ
>> > affinities).
>> >
>> > Of course, that will break everything cgroup :-(
>> >
>> I would actually *much* prefer this over the status quo. I'm tired of
>> my crappy, partially-working script that sits there and creates
>> exactly this configuration (minus the isolcpus part because I actually
>> want migration to work) on boot. (Actually, it could have two
>> automatic cgroups: /kernel and /init -- init and UMH would go in init
>> and kernel threads and such would go in /kernel. Userspace would be
>> able to request that a different cgroup be used for newly-created
>> kernel threads.)
> So there's a problem with sticking kernel threads (and esp. kthreadd)
> into !root groups. For example if you place it in a cpuset that doesn't
> have all cpus, then binding your shiny new kthread to a cpu will fail.
> You can fix that of course, and we used to do exactly that, but we kept
> running into 'fun' cases like that.

Blech. But may this *should* have that effect. I'm sick of random
kernel crap being scheduled on my RT CPUs and on the CPUs that I
intend to be kept forcibly idle.

> The unbound workqueue stuff is totally arbitrary borkage though, that
> can be made to work just fine, TJ didn't like it for some reason which I
> really cannot remember.
> Also, UMH?

User mode helper. Fortunately most users are gone now, but it still exists.