Re: 2.6.13-rt6, ktimer subsystem

From: George Anzinger
Date: Thu Sep 15 2005 - 17:39:07 EST


Thomas Gleixner wrote:
On Wed, 2005-09-14 at 12:38 -0700, George Anzinger wrote:


Well, having spent a bit of time looking at the code it appears that a lot of the ideas we looked at and discarded (see high-res-timers-discourse@xxxxxxxxxxxxxxxxxxxxx) are in this. Shame it was all done with out reference or comment to that list, anyone on it or even the lkml.


Well, I'm considering to wear sackcloth and ashes. But this seems like
the pot calling the kettle back as I don't remember a single relevant
reference/comment from you on the UTIME/KURT mailing list after your
UTIME->HRT fork.

Hm...All the work has been on HRT mailing list. In the early days the UTIME/KURT folks were on the list, still may be for all I know.

Still, as much as sackcloth and ashes might become you or me, I would rather try to learn how to be more open about what we are doing so that what results is the best possible for the linux community.


A few of the top issues:

time in nanoseconds 64-bits, requires a divide to do much of anything with it. Divides are slow and should be avoided if possible.


The divides are rare and definitely not in the hot paths. I'm sure that
they can be replaced by some intelligent scaled math algorithm if it
turns out to be necessary. The hot path instructions are simple
add/sub/cmp which are less/equal expensive on a 32bit machine to an
operation on struct timespec or an jiffies/arch_cycles pair. The non
nsec based implementation gives a burden to 64bit machines and is
provable wrong in the aspect of summing rounding errors of interval
timers.

With out arguing with your "provable wrong", I will admit that the 64-bit LOOKS good and easy. At the same time I think we can abstract all the required actions on the "nstime" thing and set it up so we can use either a timespec or a s64 depending on the machine (or some such). I have such a patch nearly ready...


~

The rbtree is a high overhead tree. I suspect performance problems here.


1. rbtree is available out of the box

True.

2. rbtree is proven to be efficient - at least there are a couple of
performance relevant users relying on it e.g. mm, ext3

3. I did insertion/removal tests with 10k entries (<2us on a 1GHz box in
user space). This is way below the experienced and reproducible trouble
of recascading. The penalty is completely on the thread which owns the
timer for non signal related functions. For signal related functions
only the removal on expiry is a penalty for the complete system (in the
softirq)

Well, I never was a cascade fan.

In the early HRT patches the whole timer list was replaced with a hashed list. It was O(N/M) on insertion where we could easily choose M (for a while it was even a config option). Removal was just an unlink, same as the cascade list.

To be clear on my take on this, as I understand it the rblist is something like O(N*somelog 2). What is left out here is the fixed overhead of F which is there even if N = 1. So we have something like (F+O(f(N)) for a list. For the most part we don't look at F, but as list complexity grows, it gets larger thus pushing out the cross over point to a higher "N" when comparing two lists. I considered the rbtree when doing the secondary list for HRT and concluded that "N" was small enough that a simple O(N/2) list would do just fine as it would only contain timers set to expire in the next jiffie.

I don't think an O(N/2) list is right for all timers (as in the ktimers), but I still wonder if there isn't something better than and rblist. Note, "wonder". I don't have such a list structure in hand.

The cascading is a penalty for the complete system all the time.

Performance is a straw man argument here. You know very well that > 90%
of the timers are inaccurate "timeout" timers related to I/O,
networking, devices. Most of those never expire (the positive feedback
removes the timer before expiry) and those timers have no constraint to
be accurate, except for the fact that they have to detect an
device/network problem at some time. In this case it is completely
irrelevant whether the timeout occurs n msecs earlier or later.

I agree, but it not accuracy that I am arguing, but cpu cycles. Those we use in the kernel are not available for the user.


If it is the right answer here, then why not use it for normal timers? A list of timers is a rather unique thing and, I think, deserves a management structure that accounts for the fact that the elements in the tree are perishable.


The first goal was to separate out the timers from the timeout API and I
believe that this separation is necessary.

The implementation of ktimers is not at all restricted to the timers we
addressed for now, it can also be utilized for the timeout API without
much effort.

The performance improvement is enormous despite the alleged 64bit math
overhead.

Testcase on a 600MHz CeleronM, 512MB RAM:

16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)
base interval = 1000 us, offset per task = 50 us tracing each latency value to disk

16 cyclic SCHED_OTHER tasks using clock_nanosleep(ABSTIME)
base interval = 1000 us, offset per task = 500 us
tracing each latency value to disk

while true; do hackbench 50; done
cat /dev/hda | nc atlas 4711

This injects a (insane) load avg of >600 permantely

2.6.13-rt4 (cascade based implementation)
16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)

task loops min max average sigma prio
0 999999 5 869 19 20 80
1 999999 9 883 18 22 79
2 999999 9 927 19 23 78
3 999999 5 908 21 28 77
4 999999 0 1056 22 33 76
5 999999 0 973 23 33 75
6 999999 0 926 23 33 74
7 999999 1 893 24 33 73
8 999999 2 942 23 34 72
9 999999 1 868 24 34 71
10 999999 0 912 23 34 70
11 999999 0 911 28 46 69
12 999999 0 912 28 46 68
13 999999 9 967 28 46 67
14 999999 9 954 28 46 66
15 999999 9 946 28 46 65


2.6.13-rt11 (ktimers based implementation)
16 cyclic SCHED_FIFO tasks using clock_nanosleep(ABSTIME)

task loops min max average sigma prio
0 999999 9 76 20 4 80
1 999999 8 84 20 4 79
2 999999 8 98 21 4 78
3 999999 9 121 20 4 77
4 999999 8 124 20 4 76
5 999999 9 140 20 4 75
6 999999 10 103 22 5 74
7 999999 9 99 21 5 73
8 999999 8 95 21 5 72
9 999999 9 148 21 5 71
10 999999 9 141 22 6 70
11 999999 9 143 22 5 69
12 999999 8 129 20 4 68
13 999999 9 149 21 5 67
14 999999 9 135 21 5 66
15 999999 8 230 22 5 65

I confess I don't understand the above numbers. What are min and max and in what units? Are you saying the large max numbers are caused by the cascade?



It appears that the "monotonic_clock" is being used to drive ktimers. The "monotonic_clock" was NEVER meant to poke outside of the kernel. It is a raw kernel clock that is only required to be monotonic with nothing said about accuracy. It should NOT be confused with CLOCK_MONOTONIC which is directly tied to xtime and therefor is ntp corrected.


I'm well aware of that and this is a completely different playground.

The ktimers implementation is _independent_ of the clock sources. The
current clock source implementation may suck, but thats not a problem of
ktimers at all. Its a problem of arch/XXX/timeYY and kernel/time.c. I
did not spend too much time on that as John Stultz is working on the
timesource and timeofday stuff and I really hope that this gets
accepted.

The relation ship of clock sources and their accuracy is also a
worthwhile field of discussion.


These are only the concerns I have from having a rather quick look at the code. I am sure that there are other issues...


It would be hubristic to deny that :)

OTH,

- The posix timer tests run all successful, except the broken 2timertest
which fails on any other HRT kernel too and the sleep to long for real
timers when the clock is set backwards, which is easily solvable
(working on that).

Your mileage seems to differ from mine. Here is what I get from ./do_test:
The following tests failed:
clock_nanosleeptest
abs_timer_test
4-1
clock_settimetest
clock_gettimetest2
2timer_test



Then, on the second run, it crashed in an attempt to get the monotonic clock (a divide error). System is a dual PIII, 800Mhz. This from the rt11 patch.

- The performance improvements in combination with simpler code is an
argument of itself

--
George Anzinger george@xxxxxxxxxx
HRT (High-res-timers): http://sourceforge.net/projects/high-res-timers/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/