Re: [PATCH] clocksource, prevent overflow in clocksource_cyc2ns
From: John Stultz
Date: Wed Apr 04 2012 - 14:01:18 EST
On 04/04/2012 08:11 AM, Prarit Bhargava wrote:
The clocksource code has a watchdog which runs once a minute. The
clocksource watchdog calculates the number of nanoseconds since the last
time the watchdog ran and compares that value to the number of nanoseconds
that have passed on another clocksource. If these values do not match (to
within .0625s) then the watchdog marks the current clocksource as unstable
and switches to another clocksource.
This works so long as the delta between calls of the watchdog is small
(maximum delta is ~18 seconds with the tsc). After that, the
clocksource_cyc2ns() calculation will overflow and the calculated number
of ns returned is incorrect.
This causes the watchdog to erroneously mark the tsc as unstable and
switch to the hpet.
A long delay on the system is not usual, however, it can be reproduced
simply by doing
echo 1> /proc/sys/kernel/sysrq
for i in `seq 10000`; do sleep 1000& done
echo t> /proc/sysrq-trigger
on a 32-cpu system (which is not an unusual number of processes for this
size of system). This floods the printk buffer and results in a
pause of approximately 600 seconds which prevents the clocksource watchdog
from running during that time. On the next call, the watchdog erroneously
marks the tsc as unstable and switches to the hpet because the checked
values for the tsc overflow.
Hmm. This is a pretty good mix of existing problems. :)
We've seen clocksource watchdog false-positives w/ virtualization
migrations before, and I think that recently got resolved . Also the
huge dumps to printk's which runs with irqs off (usually over slow
serial ports) causing lost time also is an outstanding issue.
Fixing this is simple -- use mult_frac() in clocksource_cyc2ns().
Signed-off-by: Prarit Bhargava<prarit@xxxxxxxxxx>
Cc: John Stultz<johnstul@xxxxxxxxxx>
Cc: Thomas Gleixner<tglx@xxxxxxxxxxxxx>
Cc: Salman Qazi<sqazi@xxxxxxxxxx>
Cc: stable@xxxxxxxxxx
---
include/linux/clocksource.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index fbe89e1..1625ddb 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -273,7 +273,7 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
*/
static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
{
- return ((u64) cycles * mult)>> shift;
+ return mult_frac(cycles, mult, (1UL<< shift));
}
So I'm not sure this is actually the right fix.
The first issue is that the overflow is expected at a specific
frequency, and the clocksource code and clockevent code work together to
make sure the timekeeping core can accumulate time within that
frequency. This is important not just for multiplication overflows, but
also because some clocksource hardware wraps after just a few seconds,
requiring we accumulate frequently.
The rest of the timekeeping code is designed with the requirement that
the overflow you're solving with the above doesn't happen. It doesn't
mean it can't happen (as you've experience), but that you're running
outside the expected bounds of the timekeeping code. So just fixing the
multiply in clocksource_cyc2ns doesn't also change the assumptions
elsewhere in the timekeeping code.
Additionally, gettimeofday/clock_gettime is a hotpath, and this adds
extra overhead to the calculation. So that's not great either.
So as I said above, I think there's really two issues at play here:
1) If you starve the timekeeping code from being able to periodically
accumulate time bad things happen.
2) The clocksource watchdog in particular cannot suffer much delay at
all, and falsely triggers on occasion.
Item #1 is really an issue of degree. We can try to be more flexible by
stretching the expected length of the period out further. But some
hardware that wraps quickly just cannot accommodate long intervals. On
the TSC we have quite a bit of time till it wraps, but by keeping the
periods shorter, we allow for finer grained clock adjustments for NTP.
So its somewhat of a balancing act between allowing for really precise
timekeeping and robustness in the face of interrupt starvation (and we
have only been getting better here, the periodic accumulation was only a
tick back in the day, so SMIs or any long period with irqs off would
cause lost time). Currently the TSC period should be 10 minutes before
we overflow (which aligns to your 600 seconds above). I believe this to
be a reasonable length of time where we should expect interrupts to be
able to occur. For example, other common clock hardware, like the
acpi_pm can really only last ~5 seconds before it wraps.
Item #2 is where I think the real fix is needed. The watchdog has been
useful in being able to detect flakey TSCs, but because running with a
bad TSC can be very problematic for the kernel, its is quick on the
trigger to disqualify it. Also, over the years, the expected use cases
have grown for systems, and that has run into problems as it hits the
bounds of the watchdogs' expectations. In fact, I suspect its not that
the TSC wrapping but the HPET or acpi_pm that is being used as the
"stable" watchdog clocksource. One likely way to trigger a false
positive is if the watchdog is the acpi_pm, but the clocksoruce is the
TSC, we can delay/starve interrupts for quite some time, since the TSC
won't wrap. But if the watchdog arrives late, and the acpi_pm clock has
wrapped, then it assumes the TSC is broken. The problem with these
issues is from the kernel's perspective you have to trust one of the two
clocks. A similar issue can happen w/ the HPET but not as quickly (since
it won't wrap as fast, although maybe you're hitting the multiplication
overflow there... honestly your 18 second interval time in the above
isn't lining up for me right now, have to think more on that).
The hard problem is that the TSC can go wrong in *many* ways. Its a
terrible architecture feature, but there's nothing close to it
performance wise, so we do what we can to try to make use of it when its
safe. Its variation of problems makes it hard to detect false
positives when the "trusted" watchdog fails.
One idea I've had is to replace the watchdog clocksource with the
read_persistent_clock() RTC interface. This would allow us to avoid
wrapping issues with watchdog clocksources, however it would also mean
we couldn't really check the TSC accuracy for a number of seconds to
minutes (due to the RTC granularity being just seconds itself). Or maybe
it would be good to still use the watchdog, but double check against the
RTC if we get a false positive long after boot?
Another idea I had tried to work on a few years ago was to try to catch
watchdog overflows was checking if the TSC delta == watchdog delta +
some multiple of watchdog intervals, but I didn't manage to get the math
right there, and with small enough wrapping intervals and a long enough
delay, a % error bound becomes impossible to trigger.
One idea might be to replace the cyc2ns w/ mult_frac in only the
watchdog code. I need to think on that some more (and maybe have you
provide some debug output) to really understand how that's solving the
issue for you, but it would be able to be done w/o affecting the other
assumptions of the timekeeping core.
Thomas: Any other ideas? Looking at the watchdog code, since its not an
hrtimer, could it also possibly be delayed past its .0625s threshold
even just due to long NOHZ deep idle? I suspect quite a few assumptions
have changed since that code was written. :)
thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/