Re: [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy
From: Peter Zijlstra
Date: Wed Aug 02 2017 - 12:05:35 EST
On Wed, Aug 02, 2017 at 08:41:35AM -0700, Tejun Heo wrote:
> > Not entirely sure I follow, we currently only update the current cgroup
> > and its immediate parents, no? Or are you looking to only account into
> > the current cgroup and propagate into the parents on reading?
>
> Yeah, shifting the cost to the readers and being smart with
> propagation so that reading isn't O(nr_descendants) but
> O(nr_descendants_which_have_run_since_last_read). That way, we can
> show the basic stats without taxing the hot paths with reasonable
> scalability.
Right, that would be good.
> I have a couple questions about cpuacct tho.
>
> * The stat file is sampling based and the usage files are calculated
> from actual scheduling events. Is this because the latter is more
> accurate?
So I actually don't know the history of this stuff too well. But I would
think so. This all looks rather dodgy.
> * Why do we have user/sys breakdown in usage numbers? It tries to
> distinguish user or sys by looking at task_pt_regs(). I can't see
> how this would work (e.g. interrupt handlers never schedule) and w/o
> kernel preemption, the sys part is always zero. What is this number
> supposed to mean?
For normal scheduler stuff we account the total runtime in ns and use
the user/kernel tick samples to divide it into user/kernel time parts.
See cputime_adjust().
But looking at the cpuacct I have no clue, that looks wonky at best.
Ideally we'd reuse the normal cputime code and do the same thing
per-cgroup, but clearly that isn't happening now.
I never really looked further than that cpuacct_charge() doing _another_
cgroup iteration, even though we already account that delta to each
cgroup (modulo scheduling class crud).