On 03.05.2005 [16:15:04 -0700], George Anzinger wrote:
john stultz wrote:
On Tue, 2005-05-03 at 14:36 -0700, George Anzinger wrote:
At the moment I am trying to understand and recast the work John Stultz is doing. I have some very real concerns here and there is a patch by nacc@xxxxxxxxxx<Nishanth Aravamudan> that, to me, looks like a royal pain. I think we need to understand it better and voice our concerns in a way that is constructive. Nish's patch is here: http://www.ussg.iu.edu/hypermail/linux/kernel/0504.3/1635.html
I'm not on the HRT list, so I didn't get the original e-mail...keep me
on the CC, please.
Hey George, Hopefully our work isn't causing you too much trouble. I do understand
the pain of having large conflicting works that try to achieve similar
goals.
To summarize my concerns, the HRT code depends on correct and exact registration of the timer interrupt with time of day. (We don't worry about latency here.) In the latest x86 HRT patch we fine tune the system clock to insure this alignment. This allows us to set up two stage timers that use the run_timers code for the coarse time (to the time base resolution) followed by a short high res hardware timer to get to the higher resolution we want. If the coarse timer (i.e. add_timer, run_timer and friends) do not register correctly, we will have the case of a coarse timer expiring after the needed and requested time. Now,
from a coarse timer point of view, that may be fine, but it means we can
not depend on it for the high res first stage.
The conceptual idea Nish is going after, is once we have a solid method
to get a high-res timeofday (via the new monotonic_clock() function), we
can use that instead of jiffies to expire soft-timers.
This requires re-defining each timer-bucket entry length from a jiffy
instead to a unit of time (Nish calls this a timerinterval unit). This
timer-interval unit's length is flexible can be defined at compile time.
Currently Nish is using ~1ms (1<<20 nanoseconds) as the interval length,
but it could be redefined to be as desired to give higher-res soft-
timers.
The issue is not its length, but can we interrupt at its end. The key to high res is being on time. If we want a 100usec resolution timer, we need to have the low res timer hit its mark with something real close to this.
Are you asking: Can we interrupt at the end of the timerinterval? That
doesn't matter to me, so much, since I made the timerinerval length
independent of the interrupt source. If you wanted to make sure they
corresponded completely, you simply would need to change the
nsecs_to_timerintervals() function appropriately. I was trying to stay
somewhat efficient, especially since we're dealing with a 64-bit
quantity, and thus used the shifting John mentioned above.
So, currently, my code does not provide for high-res timers,
admittedly. It does enable some of the infrastructure that should make
implementing high-res timing in mainline (turned on by tuning
TIMERINTERVAL_BITS or setting a .config option, perhaps) more feasible,
in my eyes. I think we have the same goal in mind -- make timing
flexible and useful.
To get 100 usec resolution with my system, you would need to make
TIMERINTERVAL_BITS=16. Basically, this makes a timerinterval = 1
nanosecond << 16, or 131.07 (131) microseconds. (Of course, you could
not use shifting and just divide by 10. Might be slower, but is
certainly feasible.) Admittedly, this is not perfect, but it's close.
Now, this has nothing to do with the hardware. But, it defines the
lowest interval, *regardless* of hardware, which the system will
support. If you request a 100 microsecond delay, we will make it 131
microseconds.
Now, the trick is then, to get the hardware to tick every 100
microseconds. You have a lot more expertise in this area, and I would be
interested to see what the actual options are. Presuming we are able to
get a periodic interrupt at this rate, we could then do what I think HRT
pretty much does, maintain the normal buckets for normal timers and then
have a special HRT bucket. We could then be aware of when, in
nanoseconds, the next HRT interrupt should be, set the hardware
appropriately and be ok. If there are no HRT timers, then we'll never
check that bucket, really (we could also just not configure in that
option). I think it's feasible, regardless. There may be performance
penalties, I agree. I'm working on figuring out how bad things get if we
set TIMERINTERVAL_BITS=10, for instance (about every microsecond). If
the system stays up, I'll be happy :)
Now when the new code expires timers it does so against the timeofday
subsystem's notion of time instead of jiffies. It simply goes through
all the timer bucket entries between now and the last time we expired
timers.
This is not unlike what happens now. I would hope that the number of buckets visited averages to to something real close to 1 per run_timers entry.
I agree, this is what we might be able to achieve by keeping high-res
timers in their own bucket. That way, we could conceivably know when the
next periodic interrupt is and when the next HRT interrupt is in
absolute nanoseconds (based on do_monotonic_clock()).
Now an interesting point you bring up above is when to schedule timer
interrupts. One could just have a fine-grained timer-interval unit and
crank up HZ, but clearly we don't want to schedule interrupts to go off
too frequently or the overhead won't be worth it. Instead to get high-
res timers, the idea is to look at the timer list and schedule
interrupts appropriately. Then we only take an interrupt when there is
work to do, rather then at regular periodic intervals when there may or
may not be anything to expire.
This would suggest that all timers are high res. I don't think this makes sense:
1) because most users just don't care that much about resolution and high-res carries some overhead.
2) Not all platform hardware is able to handle high res.
I agree, this is an issue which my TIMERINTERVAL_BITS=10 test should
also reveal. But under the normal case where TIMERINTERVAL_BITS are 20,
then there will never be an interrupt scheduled to go off quicker than
2^20 nanoseconds in the future (about a millisecond). Everything hinges
around these bits :) (mostly because they define how we convert
nanoseconds to timerintervals).
I think high res timers should only be used when the user asks for them. This keeps the overhead under control.
I agree -- the user asking for the, currently, would mean they set
TIMERINTERVAL_BITS lower than normal and also set up the hardware to
interrupt appropriately.
Now from my understanding of your framework, this sounds similar to how
your high-res hardware timer interrupt is used. The idea being that we
use that for all soft-timers, instead of having two levels of high-res
and coarse soft-timers.
The interesting problem that Darren will be looking at is how to
schedule interrupts accurately. This appears to be something you've
already attacked, so your experience would be helpful. My idea being
that the timer_interrupt code checks when it intended to fire and when
it actually did fire, and that information could then be used to tune
the interrupt scheduling code.
A small, measurable latency is ok and need not be backed out by the software. If you go this route you risk expiring a timer early and the standard says bad things about this.
I'm not sure what you mean about timers going off early. We round up on
addition and round down on expiration to make sure that we don't get
timers early. As long as do_monotonic_clock() is accurate, we should be
ok. (Darren's part uses the timerintervals value to figure out when the
next interrupt should be, similar to the existing
next_timer_interrupt())
What is missing in this is that the flag ship arch (x86) has piss poor capability to schedule timer interrupts. I.e. there really is no commonly available timer to generate interrupts at some arbitrary time. There is, however, the PIT, which, if you don't mess with it, is very low overhead but periodic.
And generally, we are going to stay periodic, for now. Darren's patch is
a further extension on mine, which requires a lot of testing.
Then there is the issue of what the time standard is. In the x86, the PIT is it. The pm_timer is close, but the read problem adds so much overhead as to make it hard to justify. Possibly the HPET does a better job, but it is still an I/O access (read SLOW) and is not commonly available as yet.
You also have to do something reasonable for the accounting subsystems. Currently this is done by sampling the system at a periodic rate. What ever is running gets charged with the last period. If you go non-periodic, either you have to charge a variable period when the sample is taken or you have to set up timers as part of context switching. This ladder is not wise as the context switch overhead then gets larger. It is rather easy to show that the accounting overhead in such a system with a modest load is higher than in a periodic sampled system. This system is also charged with doing the time slicing. With out a periodic tick, a timer needs to be set up each context switch.
We will always (at least in our current concept) have an idea of when,
worst/best case, the next timer interrupt will be (a maximum, predefined
interval). This would also define how long the system could go
completely idle without expiring any timers.
In addition to all of this, there is the issue of how to organize the timer list so that you can find the next timer. With the current organization this is something you just don't want to do. On the other hand, I am convinced that, for periodic timers, it is about as good as it gets.
I think currently, we do something similar to next_timer_interrupt(),
making sure we update the stored value whenever we add a new timer, if
necessary. Except now, we will determine the next time to set the hw
timer to go off every time we expire timers.
I know my proposal is a big change and does make things more difficult
in the short-term for the HRT patch. But I think it is possible to work
together and get a very reasonable system in place, which allows for
both HRT and normal timers to work effectively. I may have made things
worse in my response, but I think the discussion has to be had :)