Re: [PATCH 0/9] Per-cgroup /proc/stat

From: Glauber Costa
Date: Tue Sep 20 2011 - 17:38:37 EST


On 09/19/2011 08:07 PM, Paul Turner wrote:
On 09/15/11 01:56, Peter Zijlstra wrote:
On Wed, 2011-09-14 at 13:23 -0700, Andi Kleen wrote:
Peter Zijlstra<a.p.zijlstra@xxxxxxxxx> writes:

Guys we should seriously trim back a lot of that code, not grow ever
more and more. The sad fact is that if you build a kernel with
cpu-cgroup support the context switch cost is more than double that
of a
kernel without, and then you haven't even started creating cgroups yet.

That sounds indeed quite bad. Is it known why it is so costly?

Mostly because all data structures grow and all code paths grow, some by
quite a bit, its spread all over the place, lots of little cuts etc..

pjt and I tried trimming some of the code paths with static_branch() but
didn't really get anywhere.. need to get back to looking at this stuff
sometime soon.

When I get some time I think I'm just going to post a patch[*] that
merges the useful _field_ (usage, usage_percpu) from cpuacct into cpu
since we are *already* doing the accounting on the entity level making
this addition free.
agree.

At that point we could !CONFIG_CGROUP_CPUACCT by default and deprecate
the beast without breaking ABI for those who really need it (either
because their applications have hard-coded paths or because they really
like cgroup user/sys time -- which we COULD duplicate into cpu but I'm
inclined not to).

Well, why ? Now that I look into it, one of the nice ways to achieve what I am proposing in this patchset is:
1) get rid of cpuacct.
2) do all accounting per-cpu cgroup, and then merge it to fs/proc/stat.c

[*]: the only real caveat is how loudly people scream about the code
duplication; I think it's worth it if it let's us kill cpuacct in the
long run.

One way to deprecate it, is probably disallowing cpuacct to have any tasks written to its task file. We then expose whatever information there is in cpu/.

It may get ugly since we'll need to touch core cgroup code, but it is nice from a user PoV.

Another unrelated optimization on this path I have sitting around in
patches/ to push at some point is keeping the left-most entity out of
tree; since the worst case is an entity with a lower-vruntime comes
along and we insert the previous left-most and the best case is we get
to pick it without futzing with the rb-tree. I think this was good for a
percent or two when I hacked it together before.

Another idea I have kicking around for this path is the introduction of
a link_entity which bridges over nr_running=1 chains (break it
opportunistically when an element in the chain goes to nr_running=2).
This one requires some pretty careful accounting around the breaking of
a chain though so I'm not touching it until I get the new load tracking
code out. (Incidentally when I benchmarked it before LPC I had it
working out to be a little more efficient than the current math good for
~2-3% on pipe_test.)

- Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/