Re: [patch] Real-Time Preemption, -VP-2.6.9-rc4-mm1-U0
From: Ingo Molnar
Date: Fri Oct 15 2004 - 08:11:35 EST
* Robert Wisniewski <bob@xxxxxxxxxxxxxx> wrote:
> > > cmpxchg (basically: try reserve; if fail retry; else write), with
> > > per-cpu buffers.
> >
> > this still does not solve all problems related to irq entries: if the
> > IRQ interrups the tracing code after a 'successful reserve' but before
> > the 'else write' point, and the trace is printed/saved from an
> > interrupt, then there will be an incomplete entry in the trace.
>
> That is incorrect. The system behavior needed to generate an
> incomplete entry is far more complicated and unlikely than what you
> describe.
ah, but i'm talking about actual first-hand experience, not supposition.
It happens quite easily with latency traces (which are saved/printed
from IRQ entries) and it can be very annoying to analyze. My first
tracers tried to do things without the IRQ flag, so i've seen both
methods.
and lets not forget this other issue:
> > also, there is the problem of timestamp atomicity: if an IRQ interrupts
> > the tracing code and the trace timestamp is taken in the 'else' branch
> > then a time-reversal situation can occur: the entry will have a
> > timestamp _larger_ than the IRQ trace-entries. With cli/sti all tracing
> > entries occur atomically: either fully or not at all.
> > > >Also, cli/sti makes it obviously SMP-safe and is pretty
^^^^^^^^^^
> > > >cheap on all x86 CPUs. (Also, i didnt want to use preempt_disable/enable
^^^^^^^^^^^^^^^^^^^^^^
> > > >because the tracer interacts with that code quite heavily.)
> > >
> > > No preempt_disable/enable found in the lockless logging in relayfs.
> >
> > it would have to do that on PREEMPT_REALTIME. The irq flag solves both
> > the races, the predictability problem and the preemption problem nicely.
> >
> > Ingo
>
> If you do not care about performance then you're probably right, this
> is fine. If you are concerned about the time it takes to go through
> the sequence of code, then probably not.
see the portion i highlighted above. CPU makers are busy making cli/sti
as fast as possible. To make sure i tested it on a typical x86 box:
# ./cli-latency
CLI+STI latency: 8 cycles
Since the trace entry can be filled in a constant amount of time there's
no reason not to make use of that extra silicon that makes fast cli/sti
possible! How many trace entries can you generate per second via
relayfs, on a typical PC? Have you ever measured this?
in fact on a modern CPU cli/sti is very likely faster than a cmpxchgl
for the following reason: the cmpxchgl generates a read dependency on
the cacheline which must be fetched in. A single cachemiss can cost
_alot_ in comparison, 200 cycles easily. While in the cli/sti case we
stream out to a new cacheline in a linear fashion which is nicely
optimized by write-allocate cache policies in modern CPUs. No
cachemisses on the trace buffer! Just simple streaming out of data.
i challenge you to change your code to use cli/sti and compare it with
the cmpxchgl variant doing some heavy tracing on a modern PC. Please
just _do_ it and come back with numbers. I have done my own
measurements.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/