[RFC v8 00/28] adapt clockevents frequencies to mono clock

From: Nicolai Stange
Date: Sat Nov 19 2016 - 11:01:40 EST


Goal: avoid programming ced devices too early for large deltas, for
details, c.f. the description of [23/28].

Previous v7 can be found at [1].

While the runtimes looked Ok for x86, you were concerned about
outliers on ARM:

Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:
> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>
>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>
> So why is the variance that high?
> You have an outlier on that intel as well which might be caused by
> NMI, but it might also be a systematic issue depending on the input
> parameters. 11 us on that ARM worries me.

I did those benchmarks on a defconfig'd linux-next with a Raspbian
running in userspace. This turned out to be the worst choice I could
have made.
- The USB host controller driven by dwc2 fired 8000 interrupts per
second.
- Missing cpufreq support in upstream linux caused the core to always
run at its minimum speed.
- The Raspbian userspace was a mostly idle but quite uncontrolled
workload.

So, for the next round of measurements, I
- disabled almost all drivers, USB in particular,
- told the firmware to always run the core at its nominal max of 900MHz
- and used Busybox for the userspace.

While the adjustment process, i.e. the input parameters, had formerly
been driven by a NTP daemon, I switched to injecting deterministic
offsets via adjtimex(2) from time to time. Due to their exponential
decay, the adjustments span a larger range now.

On top of that, I generated various artificial workloads by means of
stress(1) and it turned out that clockevents_adjust_all_freqs()'
runtime is
- insensitive to input parameters,
- insensitive to CPU load (as it should),
- very sensitive to memory load.

Actual numbers can be found in the description of
[23/28] ("timekeeping: inform clockevents about freq adjustments"):
- CPU stressed system
Mean: 1733.75+-81.10
Quantiles:
0% 25% 50% 75% 100%
1458 1671 1717 1800 2083 (ns)

- idle system gives numbers very similar to the CPU stressed case

- Memory stressed system
Mean: 8750.73+-1034.77
Quantiles:
0% 25% 50% 75% 100%
3854 8171 8901 9452 13593 (ns)

Note that the "memory stressed system" is the ultimate worst case
scenario: all TLB, data and instruction caches are presumably cold and
memory is busy.

I did my best to improve on the memory loaded situation, c.f. the new
[27/28] ("clockevents: optimize struct clock_event_device layout")
and
[28/28] ("clockevents: move non-adjustable devices to the tail of the list")

The numbers finally became
Mean: 6505.18+-740.85
Quantiles:
0% 25% 50% 75% 100%
2708 6054 6523 6980 10885 (ns)

These are all >12h runs which accumulated >100000 samples and I think
that I may safely claim these 10.9us being the upper bound under a
worst case workload.

I played around with some other things like
- prefetchw()ing the clock_event_device structs,
- moving the clockevents_lock into .data, namely into the
clockevent_devices list head's cacheline
with no further improvement.

A remaining possibility would be to always_inline
__clockevents_calc_adjust_freq(): for some reason, gcc emits it into a
page different from clockevents_adjust_all_freqs()' one. However, I
didn't do that for now.

Final note: clockevents_adjust_all_freqs()' runtime is
O(#adjustable clockevent devices). That ARM has 5 of them. Due to lack
of hardware access, I can't tell how the performance might look
like on NUMA or machines with hundreds of cores.

So, I think there are three options to proceed from here:
- leave this series (modulo other review comments) as is,
i.e. continue to call clockevents_adjust_all_freqs() from
update_wall_time -> ... -> timekeeping_freqadjust()
in interrupt context,
- move the call to clockevents_adjust_all_freqs() out of interrupt
context into some scheduled work
- or abandon this series because it's too intrusive/complex
when weighed against the improvements it attempts to achieve.


Thanks,

Nicolai

[1] http://lkml.kernel.org/r/20160916200851.9273-1-nicstange@xxxxxxxxx



Changes to v7:
Rebased against next-20161117.

In order to make some room in struct clock_event_device's first
cacheline,
[20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int")
[21/28] ("clockevents: pack ->state_use_accessors and ->features together")
have been added.

This freed space is used by the new
[27/28] ("clockevents: optimize struct clock_event_device layout")

The also new
[26/28] ("clockevents: degrade ->retries to unsigned int")
allows the just mentioned [27/28] to keep struct clock_event_device
from crossing an extra multiple of the cacheline size on x86_64.

The relative order of former
[14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
and
[15-20/23], the ->min_delta_ns => ->min_delta_ticks transformation,
has been reversed because this allows for a more natural evolution
of struct clock_event_device w.r.t. to its final layout.
The latter subseries can now be found at [13-18/28] while
the "decouple ->max_delta_ns from ->max_delta_ticks" patch is at [22/28].

[28/28] ("clockevents: move non-adjustable devices to the tail of the list")
is new.

Finally, it's worth mentioning that in former
[15/23] ("clockevents: do comparison of delta against minimum in terms of cycles"),
now being [13/28], the superfluous
dev->min_delta_ticks_adjusted = max(dev->min_delta_ticks_adjusted,
dev->min_delta_ticks);
has been purged: cev_delta2ns() always rounds up and this change avoids
an access to the cache-cold ->min_delta_ticks from
clockevents_increase_min_delta().

Per-patch details:
[1-10/28]: unchanged

[11/28] ("clockevents: always initialize ->min_delta_ns and ->max_delta_ns")
- Skip CLOCK_EVT_FEAT_DUMMY devices in __clockevents_update_bounds().

[12/28] ("many clockevent drivers: don't set ->min_delta_ns and ->max_delta_ns")
- Unchanged.

[13/28] ("clockevents: do comparison of delta against minimum in terms of cycles")
- Former [15/23].
- As mentioned above, purge the
dev->min_delta_ticks_adjusted = max(dev->min_delta_ticks_adjusted,
dev->min_delta_ticks);
- Rebase in order to apply before former
[14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
which is now at [22/28]

[14/28] ("clockevents: clockevents_program_min_delta(): don't set ->next_event")
- Former [16/23], otherwise unchanged.

[15/28] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta")
- Former [17/23], otherwise unchanged.

[16/28] ("clockevents: min delta increment: calculate min_delta_ns from ticks")
- Former [18/23].
- Rebase in order to apply with the max(...) removal in [13/28].

[17/28] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically")
- Former [19/23].
- Rebase in order to apply before former
[14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
which is now at [22/28].

[18/28] ("clockevents: purge ->min_delta_ns")
- Former [20/23].
- Rebase in order to apply before former
[14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
which is now at [22/28].
- Rebase in order to apply to apply with the max(...) removal in [13/28].

[19/28] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag")
- Former [13/23], otherwise unchanged.

[20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int")
- New.

[21/28] ("clockevents: pack ->state_use_accessors and ->features together")
- New.

[22/28] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
- Former [14/23].
- Rebase to apply at this position in the series.

[23/28] ("clockevents: initial support for mono to raw time conversion")
- Former [21/23].
- Changes to clockchips.h rebased in order to apply with the new
layout of struct clock_event_device.
- Rebase in order to apply with the max(...) removal in [13/28].
- Rebase in order to apply after the new
[20/28] ("clockevents: degrade ->min_delta_ticks to unsigned int")
- Skip CLOCK_EVT_FEAT_DUMMY devices in __clockevents_adjust_freq().

[24/28] ("clockevents: make setting of ->mult and ->mult_adjusted atomic")
- Former [22/23], otherwise unchanged.

[25/28] ("timekeeping: inform clockevents about freq adjustments")
- Former [23/23].
- Add benchmark results to description.
- Skip CLOCK_EVT_FEAT_DUMMY devices in clockevents_adjust_all_freqs().

[26/28] ("clockevents: degrade ->retries to unsigned int")
- New.

[27/28] ("clockevents: optimize struct clock_event_device layout")
- New.

[28/28] ("clockevents: move non-adjustable devices to the tail of the list")
- New.


Changes to v6:
Rebased against next-20160916.

[1/23] ("clocksource: sh_cmt: compute rate before registration again")
Do not remove braces at if statement.


Changes to v5:
[21/23] ("clockevents: initial support for mono to raw time conversion")
Replace the max_t() in
adj = max_t(u64, adj, mult_ce_raw / 8);
by min_t(): mult_ce_raw / 8 actually sets an upper bound on the
mult adjustments.

[23/23] ("timekeeping: inform clockevents about freq adjustments")
Move the clockevents_adjust_all_freqs() invocation from
timekeeping_apply_adjustment() to timekeeping_freqadjust().
Reason is given in the patch description.


Changes to v4:
[1-12/23] Unchanged

[13/23] ("clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag")
New.

[14/23] ("clockevents: decouple ->max_delta_ns from ->max_delta_ticks")
New. Solves the overflow problem the former
[13/22] ("clockevents: check a programmed delta's bounds in terms of cycles")
from v4 introduced.

(Note that the former
[14/22] ("clockevents: clockevents_program_event(): turn clc into unsigned long")
from v4 has been purged.)

[15/23] ("clockevents: do comparison of delta against minimum in terms of cycles")
This is the former
[13/22] ("clockevents: check a programmed delta's bounds in terms of cycles"),
but only for the ->min_delta_* -- the ->max_delta_* are handled by [14/23] now.

[16/23] ("clockevents: clockevents_program_min_delta(): don't set ->next_event")
Former [15/22] unchanged.

[17/23] ("clockevents: use ->min_delta_ticks_adjusted to program minimum delta")
Former [16/22]. Trivially fix compilation error with
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n.

[18/22] ("clockevents: min delta increment: calculate min_delta_ns from ticks")
Former [17/22] unchanged.

[19/23] ("timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically")
Corresponds to former
[18/22] ("timer_list: print_tickdevice(): calculate ->*_delta_ns dynamically")
from v4, but only for ->min_delta_ns. The changes required for the display of
->max_delta_ns are now being made in [14/23] already.

[20/23] ("clockevents: purge ->min_delta_ns")
Corresponds to former
[19/22] ("clockevents: purge ->min_delta_ns and ->max_delta_ns"),
but with ->max_delta_ns being kept.

[21/23] ("clockevents: initial support for mono to raw time conversion")
Former [20/22] with the following changes:
- Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set.
- Don't meld __clockevents_update_bounds() into __clockevents_adjust_freq()
anymore: the bounds for those devices having CLOCK_EVT_FEAT_NO_ADJUST set
must have got their bounds set as well.
- In __clockevents_calc_adjust_freq(), make sure that the adjusted mult
doesn't exceed the original by more than 12.5%. C.f. [14/23].
- In timekeeping, define timekeeping_get_mono_mult() only for
CONFIG_GENERIC_CLOCKEVENTS=y.

[22/23] ("clockevents: make setting of ->mult and ->mult_adjusted atomic")
Former [12/22], but with the description updated: previously, it said that
this patch would introduce a new locking dependency. This is not true.

[23/23] ("timekeeping: inform clockevents about freq adjustments")
Former [22/22] with the following changes:
- Don't adjust mult for those ced's that have CLOCK_EVT_FEAT_NO_ADJUST set.
- In clockevents_adjust_all_freqs(), reuse the adjusted cached mult only
if the associated ->shift also matches.
- Introduce noop clockevents_adjust_all_freqs() in order to fix a
compilation error with CONFIG_GENERIC_CLOCKEVENTS=n.


Nicolai Stange (28):
clocksource: sh_cmt: compute rate before registration again
clocksource: sh_tmu: compute rate before registration again
clocksource: em_sti: split clock prepare and enable steps
clocksource: em_sti: compute rate before registration
clocksource: h8300_timer8: don't reset rate in ->set_state_oneshot()
clockevents: make clockevents_config() static
many clockevent drivers: set ->min_delta_ticks and ->max_delta_ticks
arch/s390/kernel/time: set ->min_delta_ticks and ->max_delta_ticks
arch/x86/platform/uv/uv_time: set ->min_delta_ticks and
->max_delta_ticks
arch/tile/kernel/time: set ->min_delta_ticks and ->max_delta_ticks
clockevents: always initialize ->min_delta_ns and ->max_delta_ns
many clockevent drivers: don't set ->min_delta_ns and ->max_delta_ns
clockevents: do comparison of delta against minimum in terms of cycles
clockevents: clockevents_program_min_delta(): don't set ->next_event
clockevents: use ->min_delta_ticks_adjusted to program minimum delta
clockevents: min delta increment: calculate min_delta_ns from ticks
timer_list: print_tickdevice(): calculate ->min_delta_ns dynamically
clockevents: purge ->min_delta_ns
clockevents: introduce CLOCK_EVT_FEAT_NO_ADJUST flag
clockevents: degrade ->min_delta_ticks to unsigned int
clockevents: pack ->state_use_accessors and ->features together
clockevents: decouple ->max_delta_ns from ->max_delta_ticks
clockevents: initial support for mono to raw time conversion
clockevents: make setting of ->mult and ->mult_adjusted atomic
timekeeping: inform clockevents about freq adjustments
clockevents: degrade ->retries to unsigned int
clockevents: optimize struct clock_event_device layout
clockevents: move non-adjustable devices to the tail of the list

arch/avr32/kernel/time.c | 4 +-
arch/blackfin/kernel/time-ts.c | 8 +-
arch/c6x/platforms/timer64.c | 4 +-
arch/hexagon/kernel/time.c | 4 +-
arch/m68k/coldfire/pit.c | 6 +-
arch/microblaze/kernel/timer.c | 6 +-
arch/mips/alchemy/common/time.c | 4 +-
arch/mips/jz4740/time.c | 4 +-
arch/mips/kernel/cevt-bcm1480.c | 4 +-
arch/mips/kernel/cevt-ds1287.c | 4 +-
arch/mips/kernel/cevt-gt641xx.c | 4 +-
arch/mips/kernel/cevt-sb1250.c | 4 +-
arch/mips/kernel/cevt-txx9.c | 5 +-
arch/mips/loongson32/common/time.c | 4 +-
arch/mips/loongson64/common/cs5536/cs5536_mfgpt.c | 4 +-
arch/mips/loongson64/loongson-3/hpet.c | 4 +-
arch/mips/ralink/cevt-rt3352.c | 4 +-
arch/mips/sgi-ip27/ip27-timer.c | 4 +-
arch/mn10300/kernel/cevt-mn10300.c | 4 +-
arch/powerpc/kernel/time.c | 6 +-
arch/s390/kernel/time.c | 4 +-
arch/score/kernel/time.c | 6 +-
arch/sparc/kernel/time_32.c | 4 +-
arch/sparc/kernel/time_64.c | 6 +-
arch/tile/kernel/time.c | 4 +-
arch/um/kernel/time.c | 4 +-
arch/unicore32/kernel/time.c | 6 +-
arch/x86/kernel/apic/apic.c | 12 +-
arch/x86/lguest/boot.c | 4 +-
arch/x86/platform/uv/uv_time.c | 6 +-
arch/x86/xen/time.c | 8 +-
drivers/clocksource/dw_apb_timer.c | 5 +-
drivers/clocksource/em_sti.c | 49 +++--
drivers/clocksource/h8300_timer8.c | 8 -
drivers/clocksource/metag_generic.c | 4 +-
drivers/clocksource/numachip.c | 4 +-
drivers/clocksource/sh_cmt.c | 45 ++--
drivers/clocksource/sh_tmu.c | 26 +--
drivers/clocksource/timer-atlas7.c | 4 +-
include/linux/clockchips.h | 49 +++--
kernel/time/clockevents.c | 244 ++++++++++++++++++----
kernel/time/tick-broadcast-hrtimer.c | 2 -
kernel/time/tick-internal.h | 6 +
kernel/time/timekeeping.c | 16 ++
kernel/time/timer_list.c | 11 +-
45 files changed, 409 insertions(+), 219 deletions(-)

--
2.10.2