Re: [PATCH 3/4] sched: introduce synchronized idle injection

From: Jacob Pan
Date: Mon Jan 11 2016 - 16:51:25 EST


On Fri, 20 Nov 2015 19:53:45 +0100 (CET)
Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:

> > This all makes the idle-injection muck hard depend on hres_active,
> > but I think that's a sane constraint anyway.
>
> It makes it actually depend on NO_HZ || HRES. It comes with a few
> other restrictions as well: I'm not going to support that for TSCs
> which stop in C3. It's ugly enough already and I really don't want to
> muck with the broadcast device.
>
> One other thing is that the caller has to ensure that the unblock()
> happens in a timely manner. I really would recommend to just disable
> preemption around that machine on fire wait loop and screw the RT
> tasks. We screw them anyway when they depend on a timer.
>
> Untested patch below.
>
Sorry it took a while, but I tested the patch below with some small
tweaks. Here are the testing results and issues.

1. Block tick works well to improve idle efficiency, e.g. on an Ivy
Bridge 4 core 8 thread desktop system. Performance per watt is
better than without blocking the ticks. It is also more
consistent and predictable. For CPU bound floating point spin workload,
compare idle percentage from 0-50%. The test below is limited to
package C7. The improvement is likely better on newer CPUs or system
tuned with deeper package C-states.

W/O patch
==========
IdlePct Perf RAPL WallPower PPW_RAPL
0 467.16 29.58 0.0 15.79
5 441.68 27.94 0.0 15.81
10 406.71 26.50 0.0 15.35
15 372.16 24.94 0.0 14.92
20 340.89 23.41 0.0 14.56
25 307.87 21.97 0.0 14.01
30 310.69 21.39 0.0 14.53
35 293.29 21.00 0.0 13.97
40 268.78 19.90 0.0 13.51
45 240.66 18.73 0.0 12.85
50 238.56 18.74 0.0 12.73

W/ patch
IdlePct Perf RAPL WallPower PPW_RAPL
0 467.19 29.71 0.0 15.73
5 445.07 28.53 0.0 15.6
10 420.34 27.43 0.0 15.32
15 398.47 26.25 0.0 15.18
20 376.47 25.05 0.0 15.03
25 354.13 23.81 0.0 14.87
30 329.50 22.57 0.0 14.6
35 304.38 21.27 0.0 14.31
40 290.42 20.21 0.0 14.37
45 254.82 16.70 0.0 15.26
50 236.10 17.75 0.0 13.3

Note: PPW_RAPL= performance per watt measured by RAPL package power
meter.

If we just look at 20% forced idle, I can see significantly more deep
package C-states(PC7) residency.
Pkg%pc2 Pkg%pc6 Pkg%pc7
w/o patch 3.17 4.39 13.21
w/ patch 1.15 1.43 17.24

2. run into warning
[ 109.035699] WARNING: CPU: 0 PID: 2748 at
kernel/time/clockevents.c:325 clockevents_program_event+0x102/0x110()
[ 109.035700] Current state: 0
I think it is caused by tick_throttler in CLOCK_EVT_STATE_DETACHED
state. perhaps fix it with below?
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1141,6 +1141,7 @@ static struct clock_event_device tick_throttler =
{
.features = CLOCK_EVT_FEAT_ONESHOT,
.event_handler = tick_throttling_handler,
.set_next_event = tick_throttling_noop,
+ .state_use_accessors = CLOCK_EVT_STATE_ONESHOT,
.mult = 1,
.shift = 0,
.irq = -1,

3. typo? since we use ns as delay unit
@@ -1175,7 +1176,7 @@ int tick_nohz_block_tick(u64 delay)
if (WARN_ON_ONCE(td->evtdev == &tick_throttler))
goto out;

- if (delay > td->evtdev->max_delta_ticks) {
+ if (delay > td->evtdev->max_delta_ns)


4. I see random message about "sched: RT throttling activated", not
sure accounting is being affected. I never got close to 95% RT.

5. Random soft lockup, stall, detected by NMI watchdog, or RCU preempt,
still debugging but each idle duration is only 5 ticks. I also disabled
preemption as suggested.

Thanks,

Jacob

> Thanks,
>
> tglx
>
> 8<----------------
> Subject: tick/nohz: Allow blocking the tick to prevent fire
> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Date: Fri, 20 Nov 2015 14:28:17 +0100
>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
> include/linux/tick.h | 5 ++
> kernel/time/Kconfig | 5 ++
> kernel/time/tick-sched.c | 99
> +++++++++++++++++++++++++++++++++++++++++++++++
> kernel/time/tick-sched.h | 7 ++- 4 files changed, 114
> insertions(+), 2 deletions(-)
>
> Index: tip/include/linux/tick.h
> ===================================================================
> --- tip.orig/include/linux/tick.h
> +++ tip/include/linux/tick.h
> @@ -203,4 +203,9 @@ static inline void tick_nohz_task_switch
> __tick_nohz_task_switch();
> }
>
> +#ifdef CONFIG_TICK_THROTTLING
> +int tick_nohz_block_tick(u64 delay);
> +void tick_nohz_unblock_tick(void);
> +#endif
> +
> #endif
> Index: tip/kernel/time/Kconfig
> ===================================================================
> --- tip.orig/kernel/time/Kconfig
> +++ tip/kernel/time/Kconfig
> @@ -60,6 +60,11 @@ menu "Timers subsystem"
> config TICK_ONESHOT
> bool
>
> +# Special functions for tick throttling to avoid fire extinguishers
> +config TICK_THROTTLING
> + bool
> + depends on TICK_ONESHOT
> +
> config NO_HZ_COMMON
> bool
> depends on !ARCH_USES_GETTIMEOFFSET && GENERIC_CLOCKEVENTS
> Index: tip/kernel/time/tick-sched.c
> ===================================================================
> --- tip.orig/kernel/time/tick-sched.c
> +++ tip/kernel/time/tick-sched.c
> @@ -1119,6 +1119,105 @@ void tick_setup_sched_timer(void)
> }
> #endif /* HIGH_RES_TIMERS */
>
> +#ifdef CONFIG_TICK_THROTTLING
> +/*
> + * An interrupt might have been pending already, when we programmed
> + * the throttler time. Nothing to do here. The device is armed and
> the
> + * interrupt will fire again. If this is the real wakeup event then
> + * the unblock call will retrigger it.
> + */
> +static void tick_throttling_handler(struct clock_event_device *dev)
> +{
> +}
> +
> +static int tick_throttling_noop(unsigned long evt,
> + struct clock_event_device *d)
> +{
> + return 0;
> +}
> +
> +static struct clock_event_device tick_throttler = {
> + .name = "throttler",
> + .features = CLOCK_EVT_FEAT_ONESHOT,
> + .event_handler = tick_throttling_handler,
> + .set_next_event = tick_throttling_noop,
> + .mult = 1,
> + .shift = 0,
> + .irq = -1,
> +};
> +
> +/**
> + * tick_nohz_block_tick - Force block the tick to prevent fire
> + * @delay: Defer the tick for X nano seconds
> + *
> + * This is a special interface for thermal emergencies. Do not use
> + * otherwise! Make sure to call tick_nohz_block_tick() right after
> + * the delay ends to undo the damage.
> + */
> +int tick_nohz_block_tick(u64 delay)
> +{
> + struct tick_device *td;
> + unsigned long flags;
> + int ret = -EBUSY;
> + ktime_t until;
> +
> + if (!tick_nohz_active)
> + return -EBUSY;
> +
> + local_irq_save(flags);
> + td = this_cpu_ptr(&tick_cpu_device);
> +
> + /* No way to do that with broadcast */
> + if (td->evtdev->features & CLOCK_EVT_FEAT_C3STOP)
> + goto out;
> +
> + /* Yes, I do not trust people! */
> + if (WARN_ON_ONCE(td->evtdev == &tick_throttler))
> + goto out;
> +
> + if (delay > td->evtdev->max_delta_ticks) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + until = ktime_add_ns(ktime_get(), delay);
> + if (!tick_program_event(until, 0)) {
> + td->real_evtdev = td->evtdev;
> + td->evtdev = &tick_throttler;
> + ret = 0;
> + }
> +out:
> + local_irq_restore(flags);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(tick_nohz_block_tick);
> +
> +/**
> + * tick_nohz_force_unblock_tick - Undo the force blocking of the tick
> + *
> + * Pairs with tick_nohz_block_tick(). Can be called unconditionally
> + * even if the tick was not blocked by tick_nohz_block_tick().
> + */
> +void tick_nohz_unblock_tick(void)
> +{
> + struct tick_device *td;
> + unsigned long flags;
> +
> + if (!tick_nohz_active)
> + return;
> +
> + local_irq_save(flags);
> + td = this_cpu_ptr(&tick_cpu_device);
> + if (td->evtdev == &tick_throttler) {
> + td->evtdev = td->real_evtdev;
> + /* Force a timer interrupt now */
> + tick_program_event(ktime_get(), 1);
> + }
> + local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(tick_nohz_unblock_tick);
> +#endif /* TICK_THROTTLING */
> +
> #if defined CONFIG_NO_HZ_COMMON || defined CONFIG_HIGH_RES_TIMERS
> void tick_cancel_sched_timer(int cpu)
> {
> Index: tip/kernel/time/tick-sched.h
> ===================================================================
> --- tip.orig/kernel/time/tick-sched.h
> +++ tip/kernel/time/tick-sched.h
> @@ -9,8 +9,11 @@ enum tick_device_mode {
> };
>
> struct tick_device {
> - struct clock_event_device *evtdev;
> - enum tick_device_mode mode;
> + struct clock_event_device *evtdev;
> + enum tick_device_mode mode;
> +#ifdef CONFIG_TICK_THROTTLING
> + struct clock_event_device *real_evtdev;
> +#endif
> };
>
> enum tick_nohz_mode {

[Jacob Pan]