Re: [PATCH 0/4] Finer granularity and task/cgroup irq timeaccounting

From: Peter Zijlstra
Date: Tue Aug 24 2010 - 16:39:43 EST


On Tue, 2010-08-24 at 12:20 -0700, Venkatesh Pallipadi wrote:

> (long email alert)
> I have two different answers for why we ended up with this madness.
>
> My personal take on why we need this and the actual flow why I ended
> up with this patchset.
>
> - Current /proc/stat hardirq and softirq time reporting is broken for
> most archs as it does tick sampling. Hardirq time specifically is
> further broken due to interrupts being disabled during irq -
> http://kerneltrap.org/mailarchive/linux-kernel/2010/5/25/4574864

Yeah, architectures without a decent clock are a pain (x86 is still on
that list although nhm/wsm don't suck too bad), but it might be
worthwhile to look at what arch/$foo are strictly tick based.

A quick look suggests:

alpha
arm (some)
avr32
cris (it could remove its implementation, its identical
to the weak function provided by kernel/sched_clock.c)
frv (idem)
h8300
m32r
m68k* (except nommu-coldfire)
mips (except cavium-octeon)
parisc
score
sh
xtensa

which seems to mean too damn many, I bet we can't simply move those to
staging? :-)

> OK. Lets fix /proc/stat. But, that doesn't seem enough. We should also
> not account this time to tasks themselves.

Right

> - I started looking as not accounting this time to tasks themselves.
> This was really tricky as things are tightly tied to scheduler
> vruntime to get it right.

I'm not exactly sure where that would get complicated, simply treat
interrupts the same as preemptions by other tasks and things should
basically work out rather straight forward from that.

> I am not even sure I got it totally right
> :(, but I did play with the patch a bit. And noticed there were
> multiple issues. 1) A silly case as in of two tasks on one CPU, one
> task totally CPU bound and another task doing network recv. This is
> how task and softirq time looks like for this (10s samples)
> (loop) (nc)
> 503 9 502 301
> 502 8 502 303
> 502 9 501 302
> 502 8 502 302
> 503 9 501 302
> Now, when I did "not account si time to task", the loop task ended up
> getting a lot less CPU time and doing less work as nc task doing rcv
> got more CPU share, which was not right thing to do. IIRC, I had
> something like <300 centiseconds for loop after the change (with si
> activity increasing due to higher runtime of nc task).

Well, that actually makes sense and I wouldn't call it wrong.

> 2) Also, a minor problem of breaking current userspace API for
> tasks/cgroup stats assume that irq times are included.

Is that actually specified or simply assumed because our implementation
always had that bug? I would really call not accounting irq time to
tasks a bug-fix.

> So, even though it seems accounting irq time as "system time" seems
> the right thing to do, it can break scheduling in many ways. May be
> hardirq can be accounted as system time. But, dealing with softirq is
> tricky as they can be related to the task.

I haven't yet seen any scheduler breakage here, it will divide time
differently, but not in a broken way, if the system consumes 1/3rd of
the time, there's only 2/3rd left to fairly distribute between tasks, so
something like, 1/3-loop 1/3-nc 1/3-softirq makes perfect sense.

You'd get exactly the same kind of thing if you replace (soft)irq with a
FIFO task.

The whole schizo softirq infrastructure (hardirq tails and tasks) is a
pain though, I would really love to rid the kernel of it, but I've got
no idea how to do something like that given that things like the whole
network subsystem are tightly woven into it.

> Figuring out si time and accouting to the right task is a non-starter.
> There are so many different ways in which si will come into picture.
> finding and accounting it to right task will be almost impossible.

Agreed, hence:

> So, why not do the simple things first. Do not disturb any existing
> scheduling decisions, account accurate hi and si times system wide,
> per task, per cgroup (with as less overhead as possible). Give this
> info to users and admin programs and they may make a higher level
> sense of this.
>
> Having looked at both the options, I feel having these export is an
> immediate first step.

This is where I strongly disagree, providing an interface that cannot
possibly be implemented correctly just so you can fudge something (still
not sure what from userspace) seems a very bad idea indeed.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/