Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy

From: Tejun Heo
Date: Wed Aug 02 2017 - 11:41:46 EST

Hello, Peter.

On Tue, Aug 01, 2017 at 11:40:38PM +0200, Peter Zijlstra wrote:
> > * On cgroup2, there is only one hierarchy. It'd be great to have
> > basic resource accounting enabled by default on all cgroups. Note
> > that we couldn't do that on v1 because there could be any number of
> > hierarchies and the cost would increase with the number of
> > hierarchies.
> Yes, the whole single hierarchy thing makes doing away with the double
> accounting possible.

Yeah, we can either do that or make it cheaper so that we can have
basic stats by default.

> > * It is bothersome that we're walking up the tree each time for
> > cpuacct although being percpu && just walking up the tree makes it
> > relatively cheap.
> So even if its only CPU local accounting, you still have all the pointer
> chasing and misses, not to mention that a faster O(depth) is still
> O(depth).
> > Anyways, I'm thinking about shifting the
> > aggregation to the reader side so that the hot path always only
> > updates local counters in a way which can scale even when there are
> > a lot of (idle) cgroups. Will follow up on this later.
> Not entirely sure I follow, we currently only update the current cgroup
> and its immediate parents, no? Or are you looking to only account into
> the current cgroup and propagate into the parents on reading?

Yeah, shifting the cost to the readers and being smart with
propagation so that reading isn't O(nr_descendants) but
O(nr_descendants_which_have_run_since_last_read). That way, we can
show the basic stats without taxing the hot paths with reasonable

I have a couple questions about cpuacct tho.

* The stat file is sampling based and the usage files are calculated
from actual scheduling events. Is this because the latter is more

* Why do we have user/sys breakdown in usage numbers? It tries to
distinguish user or sys by looking at task_pt_regs(). I can't see
how this would work (e.g. interrupt handlers never schedule) and w/o
kernel preemption, the sys part is always zero. What is this number
supposed to mean?