In summary, the occasional variance I've measured with my test code is
consistent with the variance I've measured using lat_ctx from lmbench,
once you take into account the FPU state save/restore overheads for
lmbench which don't apply to the default case for my test.
Both benchmarks are left with a 30% variance. I suspect cache effects
and possibly a lingering bug in lazy FPU state save/restoring, but at
this point I don't have anything concrete.
Furthermore, the absolute times obtained with lmbench and my code are
similar (5.25 us for lmbench and 4.8 us for my code on a PPro 180).
The other major part of my benchmark (and the original driving goal)
is the cost of extra processes on the run queue. These slow down
context switch times (or increase wakeup latencies). For a Pentium 100
thread switch times go from 2.8 us to 11.7 us when 10 processes are
added. Even just having an extra 3 processes would double the thread
switch time.
I get even more pessimistic results using lmbench.
I maintain that a separate run queue for RT processes would:
- reduce latencies on shared machines (i.e. RT and user jobs)
- improve the determinism of RT scheduling latencies on machines with
moderate to high user load
- simplify and clarify the scheduling code (currently RT processes
require a number of special cases in the mainline code, and this
complexity has contributed to bugs in the current (2.1.122) handling
of RT processes)
and hence is a good idea.
The WWW page contains more details for those who are interested.
Regards,
Richard....
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/