[RFC v7 16/23] clockevents: clockevents_program_min_delta(): don't set ->next_event

From: Nicolai Stange
Date: Fri Sep 16 2016 - 16:16:23 EST

Currently, clockevents_program_min_delta() sets a clockevent device's
->next_event to the point in time where the minimum delta would actually

delta = dev->min_delta_ns;
dev->next_event = ktime_add_ns(ktime_get(), delta);

For your reference, this is so since the initial advent of
clockevents_program_min_delta() with
commit d1748302f70b ("clockevents: Make minimum delay adjustments

clockevents_program_min_delta() is called from clockevents_program_event()
only. More specifically, it is called if the latter's force argument is set
and, neglecting the case of device programming failure for the moment, if
the requested expiry is in the past.

On the contrary, if the expiry requested from clockevents_program_event()
is in the future, but less than ->min_delta_ns behind, then
- ->next_event gets set to that expiry verbatim
- but the clockevent device gets silently programmed to fire after
->min_delta_ns only.

Thus, in the extreme cases of expires == ktime_get() and
expires == ktime_get() + 1, the respective values of ->next_event would
differ by ->min_delta_ns while the clockevent device would actually get
programmed to fire at (almost) the same times (with force being set,
of course).

While this discontinuity of ->next_event at expires == ktime_get() is not
a problem by itself, the mere use of ->min_delta_ns in the event
programming path hinders upcoming changes making the clockevent core
NTP correction aware: both, ->mult and ->min_delta_ns would need to get
updated as well as consumed atomically and we'd rather like to avoid any
locking here.

Thus, let clockevents_program_event() unconditionally set ->next_event to
the expiry time actually requested by its caller, i.e. don't set
->next_event from clockevents_program_min_delta().

A few notes on why this change is safe with the current consumers of
Note that a clockevents_program_event() with a requested expiry in the
past and force being set basically means: "fire ASAP". Now, consider this
so programmed event getting handed once again to
clockevents_program_event(), i.e. that a

clockevents_program_event(dev, dev->next_event, false)

as in __clockevents_update_freq() is done.
With this change applied, clockevents_program_event() would now properly
detect the expiry being in the past and, due to the force argument being
unset, wouldn't actually do anything.
Before this change OTOH, there would be the (very unlikely) possibility
that the requested event is still somewhere in the future and
clockevents_program_event() would silently delay the event expiration by
another ->min_delta_ns.

The periodic tick handlers on oneshot-only devices use ->next_event
to calculate the followup expiry time.
tick_handle_periodic() spins on reprogramming the clockevent device
until some expiry in the future has been reached:

ktime_t next = dev->next_event;
for(;;) {
next = ktime_add(next, tick_period);
if (!clockevents_program_event(dev, next, false))

Thus, tick_handle_periodic() isn't affected by this change.
For tick_handle_periodic_broadcast(), the situation is different since

commit 2951d5c031a3 ("tick: broadcast: Prevent livelock from event

though: a loop similar to the one from tick_handle_periodic() above got
replaced by a single

ktime_t next = ktime_add(dev->next_event, tick_period);
clockevents_program_event(dev, next, true);

In the case that dev->next_event + tick_period happens to be less than
ktime_get() + ->min_delta_ns, without this change applied, ->next_event
would get recovered to some point in the future after a single
tick_handle_periodic_broadcast() event.
On the contrary, with this patch applied, it could potentially take some
number of tick_handle_periodic_broadcast() events, each separated by
->min_delta_ns only, until ->next_event is able to catch up with the
current ktime_get(). However, if this turns out to become a problem,
the reprogramming loop in tick_handle_periodic_broadcast() can probably
be restored easily.

In kernel/time/tick-broadcast.c, the broadcast receiving clockevent
devices' ->next_event is read multiple times in order to determine who's
next or who must be pinged. These uses all continue to work. Moreover,
clockevent devices getting programmed to something less than
ktime_get() + ->min_delta_ns
might not be the best candidates for a transition into C3 anyway.

Finally, a "sleep length" is calculated at the very end of
tick_nohz_stop_sched_tick() as follows:

ts->sleep_length = ktime_sub(dev->next_event, now);

AFAICS, this can happen to be negative w/o this change applied already: in
NOHZ_MODE_HIGHRES mode there can be some overdue hrtimers whose removal is
blocked because tick_nohz_stop_sched_tick() gets called with interrupts
disabled. Unfortunately, the only user, the menu cpuidle governor,
can't cope with negative sleep lengths as it casts the return value
of the tick_nohz_get_sleep_length() getter to an unsigned int.
This change can very well make things worse here. A followup patch
will force this ->sleep_length to >= 0.

Signed-off-by: Nicolai Stange <nicstange@xxxxxxxxx>
kernel/time/clockevents.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index f41f584..8983fee 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -252,7 +252,6 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)

for (i = 0;;) {
delta = dev->min_delta_ns;
- dev->next_event = ktime_add_ns(ktime_get(), delta);

if (clockevent_state_shutdown(dev))
return 0;
@@ -289,7 +288,6 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)
int64_t delta;

delta = dev->min_delta_ns;
- dev->next_event = ktime_add_ns(ktime_get(), delta);

if (clockevent_state_shutdown(dev))
return 0;