This is exactly what I noticed while doing the pidhash patch. The
scheduler is the only place in the kernel, except for three cases I
think (killall, exit while being ptraced, and soon 'remove
capabilities on all processes/set securelevel') that will traverse all
processes. It is thus the only place where we depend on the number of
processes. However, I was not aware of any recent sched_yield() fixes
[it was not fixed in 2.1.90 it seems], nor that the recalculations are
less frequent when the number of processes grow. Considering 2000
processes, that makes for a recalculation fewer than once per 6
minutes. Isn't that _way_ too seldom? I have a feeling that the recent
sched_yield() "fix" might have some bad side-effects on large systems.
If you look at ftp://ftp.guardian.no/pub/free/linux/pidhash.gif , I
have a green line (2nd from the top) that shows the slowest time for a
fork+exit+waitpid. It looks very linear, with only a few exceptions
[the three spikes could be some cache anomality??] and to me, it
seemed logical that these samples plain bad luck with the scheduler
which wanted to do a recalculation. If it happens only each 6th minute
at 2000 processes, I must have been wrong, or the system had some
process calling sched_yield() at the time, making the recalculations
far more common. Note that the recalculation takes 1ms at 2000
processes.
I agree that RL efficiency of the scheduler is most important, but
nevertheless, it would be nice to remove the last "unnecessary"
O(nr_tasks) from the kernel. [I _am_ taking for granted that someone
will figure out how to do recalculations in less than O(nr_tasks) time
;-)].
astor
-- Alexander Kjeldaas, Guardian Networks AS, Trondheim, Norway http://www.guardian.no/- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu