Re: [RFC] sched: CPU topology try

From: Arjan van de Ven
Date: Mon Jan 06 2014 - 12:13:36 EST



AFAICT this is a chicken-egg problem, the OS never did anything useful
with it so the hardware guys are now trying to do something with it, but
this also means that if we cannot predict what the hardware will do
under certain circumstances the OS really cannot do anything smart
anymore.

So yes, for certain hardware we'll just have to give up and not do
anything.

That said, some hardware still does allow us to do something and for
those we do need some of this.

Maybe if the OS becomes smart enough the hardware guys will give us some
control again, who knows.

So yes, I'm entirely fine saying that some chips are fucked and we can't
do anything sane with them.. Fine they get to sort things themselves.

That is; you're entirely unhelpful and I'm tempting to stop listening
to whatever you have to say on the subject.

Most of your emails are about how stuff cannot possibly work; without
saying how things can work.

The entire point of adding P and C state information to the scheduler is
so that we CAN do cross cpu decisions, but if you're saying we shouldn't
attempt because you can't say how the hardware will react anyway; fine
we'll ignore Intel hardware from now on.

that's not what I'm trying to say.

if we as OS want to help make such decisions, we also need to face reality of what it means,
and see how we can get there.

let me give a simple but common example case, of a 2 core system where the cores share P state.
one task (A) is high priority/high utilization/whatever
(e.g. causes the OS to ask for high performance from the CPU if by itself)
the other task (B), on the 2nd core, is not that high priority/utilization/etc
(e.g. would cause the OS to ask for max power savings from the CPU if by itself)


time core 0 core 1 what the combined probably should be
0 task A idle max performance
1 task A task B max performance
2 idle (disk IO) task B least power
3 task A task B max performance

e.g. a simple case of task A running, and task B coming in... but then task A blocks briefly,
on say disk IO or some mutex or whatever.

we as OS will need to figure out how to get to the combined result, in a way that's relatively race free,
with two common races to take care of:
* knowing if another core is idle at any time is inherently racey.. it may wake up or go idle the next cycle
* in hardware modes where the OS controls all, the P state registers tend to be "the last one to write on any
core controls them all" way; we need to make sure we don't fight ourselves here and assign a core to do
this decision/communication to hardware on behalf of the whole domain (even if the core that's
assigned may move around when the assigned core goes idle) rather than the various cores doing it themselves async.
This tends to be harder than it seems if you also don't want to lose efficiency (e.g. no significant extra
wakeups from idle and also not missing opportunities to go to "least power" in the "time 2" scenario above)


x86 and modern ARM (snapdragon at least) do this kind of coordination in hardware/microcontroller (with an opt in for the OS to
do it itself on x86 and likely snapdragon) which means the race conditions are not really there.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/