[PATCH 0/8] sched_domain balancing via softirq V4
From: Christoph Lameter
Date: Tue Nov 14 2006 - 15:34:48 EST
This patchset moves more or less expensive load balancing out of the scheduler
tick (where we run with interrupts disabled) into a softirq that is triggered
if necessary from scheduler_tick(). Load balancing will then run with interrupts
enabled. This first of all reduces interrupt holdoff times.
The moving of the load balancing into a softirq allows some cleanup in
scheduler_tick(). It is easier to read and the determination of the state
for load balancing can be moved out of scheduler_tick(). We can decouple
load balancing from scheduler_tick(). Load balancing is then only triggered
on demand via the softirq. On a dual core processor (SMP) system load
balacing is triggered in less than 30% of all ticks.
The timer ticks are already staggered by arch initialization. It is not
necessary to stagger load balancing if the load balancing takes a reasonably
small time since it is part of the timer tick processing. Lower sched domains
generally fall into that category. We remove the staggering from
the scheduler.
We add a spinlock for the higher sched_domains that may require longer
scan times. A new flag SD_SERIALIZE can be set for a sched domain. Then
we insure that balancing only occurs once on the whole machine for the
sched domains that have SD_SERIALIZE set. This guarantees exclusion
even if balancing runs for a long time. The staggering was not able
to make this guarantee.
The serialization insures that we do not run into issues where multiple
processors load balance at the same time and then attempt to draw
processes of the same remote processor. It limits the load that
can be generated by load balancing for large and very large systems.
There are some other ideas around on how to optimize scheduler
performance for high processor counts (like Suresh's approach
to only load balance for a single processor in a domain and Ken's idea
of rewriting the scheduler load balancing to be more flexible) but
none of those are ready for prime time yet. These approaches could
replace serialization in the future.
The serialization for the NUMA scheduling alone also means that
the number of times that scheduling has to be deferred drops significantly
and will only occur in case of large scale NUMA balancing.
Load balancing on a particular node is not that critical (especially
with Suresh's latest patch that places all sched_groups on the node) since
accesses are node local and generally do not require transactions on the
NUMA interlink.
Tested on
UP: x86_64
SMP: i386 dual core Pentium 940
NUMA: Altix 8p 256p
For the earlier discussion see:
RFC: http://marc.theaimsgroup.com/?t=116119187800002&r=1&w=2
V1: http://marc.theaimsgroup.com/?l=linux-kernel&m=116171494001548&w=2
V2: http://marc.theaimsgroup.com/?l=linux-kernel&m=116200355408187&w=2
V3: http://marc.theaimsgroup.com/?l=linux-kernel&m=116258708323481&w=2
http://marc.theaimsgroup.com/?t=116259165600007&r=1&w=2
V1-V2:
- Keep last_balance and calculate the next balancing from that start
point.
- Move more code into time_slice calculation and rename time_slice()
to task_running_tick().
- Separate out the wake_priority_sleeper optimization as a first patch.
V2->V3
- Rediff against 2.6.19-rc4-mm2
- Remove useless check for rq->idle in rebalance_domains()
V3->V4
- Use softirq instead of a tasklet
- Remove load staggering.
- Add lock to run some sched domains single threaded.
- Use jiffy comparison functions
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/