Re: [PATCH] - Reduce overhead of calc_load

From: Nick Piggin
Date: Sat Mar 18 2006 - 00:36:30 EST


Andrew Morton wrote:
Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:

Andrew Morton wrote:

Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:

Is there a need? Do they (except calc_load) use multiple values at
the same time?


Don't know. It might happen in the future. And the additional cost is
practically zero.


Unless it happens to hit another cacheline (cachelines for all other
CPUs but our own will most likely be invalid on this cpu). In which
case the cost could double quite easily.



That would be an inefficient implementation. Let's not implement it
inefficiently.


Unconditionally adding up n fields in the runqueue versus 1 field?
It is inevitable that they will cross cacheline boundaries on some
CPU architectures and with some per-cpu implementations isn't it?


I think it might be better to leave it for the moment. If something comes
up we can always take a look at it then (it isn't particularly tricky code).


What we're seeing here is a proliferation of little functions, all of which
do the same thing, some of them in different ways.


Of course they should be made consistent where it makes sense.

Take a look at (for example) nr_iowait. We forget to spill the count out
of the departing CPU's runqueue and hence we have to sum it across all

I don't think a departing runqueue should have any iowaiters on it, should it?

possible CPUs and we don't bother accounting for the possibility of the sum
going negative because we happen to dink with the runqueue of a
now-possibly-downed CPU. It's inefficient and it's inconsistent and some
of it is, or will become incorrect. The other counters there probably have
various combinations of these problems but I can't be bothered checking
them all because they're all implemented differently.


Maybe (they're also used in different ways, and with different races to
be careful of so in some respects that is inevitable). But that doesn't
mean we should introduce this new get_sched_stats thing.

Better to do them all in the one place and do them all the same way. I'd
suggest a cacheline-aligned struct of local_t's which can be queried into a
struct of ulongs.


Now common scheduler operations have to access the runqueue cacheilnes
and this disjoint "stats" structure cacheline, so basic operations will
get slower. Not to mention that all but one are protected by the runqueue
lock, so local_t isn't needed for the rest of them.

That query should only look at online CPUs, which becomes rather necessary
if we're to allocate runqueues only for online CPUs (desirable - the thing
is huge).


Sure, those consistency and efficiency changes should be made now, with
the current structure.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/