[PATCH] timer: make deferrable cpu unbound timers really not bound to a cpu
From: Joonwoo Park
Date: Thu Sep 11 2014 - 18:34:25 EST
When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.
As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically. But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.
The deferrable work is a good example of the current implementation's
victim like below.
INIT_DEFERRABLE_WORK(&dwork, fn);
CPU 0 CPU 1
queue_delayed_work(wq, &dwork, HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() -> queues timer to the
current cpu's timer
base.
...
tick_nohz_idle_enter() -> cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.
To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.
CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
CC: John Stultz <john.stultz@xxxxxxxxxx>
CC: Tejun Heo <tj@xxxxxxxxxx>
Signed-off-by: Joonwoo Park <joonwoop@xxxxxxxxxxxxxx>
---
Changes in v3:
* Make only tick_do_timer_cpu to run deferral timer wheel to reduce cache bouncing.
kernel/time/timer.c | 94 ++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 71 insertions(+), 23 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c5..59306fe 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -49,6 +49,8 @@
#include <asm/timex.h>
#include <asm/io.h>
+#include "tick-internal.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/timer.h>
@@ -93,6 +95,9 @@ struct tvec_base {
struct tvec_base boot_tvec_bases;
EXPORT_SYMBOL(boot_tvec_bases);
static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;
+#ifdef CONFIG_SMP
+static struct tvec_base *tvec_base_deferral = &boot_tvec_bases;
+#endif
/* Functions below help us manage 'deferrable' flag */
static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -655,7 +660,14 @@ static inline void debug_assert_init(struct timer_list *timer)
static void do_init_timer(struct timer_list *timer, unsigned int flags,
const char *name, struct lock_class_key *key)
{
- struct tvec_base *base = raw_cpu_read(tvec_bases);
+ struct tvec_base *base;
+
+#ifdef CONFIG_SMP
+ if (flags & TIMER_DEFERRABLE)
+ base = tvec_base_deferral;
+ else
+#endif
+ base = raw_cpu_read(tvec_bases);
timer->entry.next = NULL;
timer->base = (void *)((unsigned long)base | flags);
@@ -777,26 +789,32 @@ __mod_timer(struct timer_list *timer, unsigned long expires,
debug_activate(timer, expires);
- cpu = get_nohz_timer_target(pinned);
- new_base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+ if (base != tvec_base_deferral) {
+#endif
+ cpu = get_nohz_timer_target(pinned);
+ new_base = per_cpu(tvec_bases, cpu);
- if (base != new_base) {
- /*
- * We are trying to schedule the timer on the local CPU.
- * However we can't change timer's base while it is running,
- * otherwise del_timer_sync() can't detect that the timer's
- * handler yet has not finished. This also guarantees that
- * the timer is serialized wrt itself.
- */
- if (likely(base->running_timer != timer)) {
- /* See the comment in lock_timer_base() */
- timer_set_base(timer, NULL);
- spin_unlock(&base->lock);
- base = new_base;
- spin_lock(&base->lock);
- timer_set_base(timer, base);
+ if (base != new_base) {
+ /*
+ * We are trying to schedule the timer on the local CPU.
+ * However we can't change timer's base while it is
+ * running, otherwise del_timer_sync() can't detect that
+ * the timer's handler yet has not finished. This also
+ * guarantees that the timer is serialized wrt itself.
+ */
+ if (likely(base->running_timer != timer)) {
+ /* See the comment in lock_timer_base() */
+ timer_set_base(timer, NULL);
+ spin_unlock(&base->lock);
+ base = new_base;
+ spin_lock(&base->lock);
+ timer_set_base(timer, base);
+ }
}
+#ifdef CONFIG_SMP
}
+#endif
timer->expires = expires;
internal_add_timer(base, timer);
@@ -1399,6 +1417,12 @@ static void run_timer_softirq(struct softirq_action *h)
hrtimer_run_pending();
+#ifdef CONFIG_SMP
+ if (smp_processor_id() == tick_do_timer_cpu &&
+ time_after_eq(jiffies, tvec_base_deferral->timer_jiffies))
+ __run_timers(tvec_base_deferral);
+#endif
+
if (time_after_eq(jiffies, base->timer_jiffies))
__run_timers(base);
}
@@ -1536,7 +1560,7 @@ static int init_timers_cpu(int cpu)
{
int j;
struct tvec_base *base;
- static char tvec_base_done[NR_CPUS];
+ static char tvec_base_done[NR_CPUS + 1];
if (!tvec_base_done[cpu]) {
static char boot_done;
@@ -1545,8 +1569,12 @@ static int init_timers_cpu(int cpu)
/*
* The APs use this path later in boot
*/
- base = kzalloc_node(sizeof(*base), GFP_KERNEL,
- cpu_to_node(cpu));
+ if (cpu != NR_CPUS)
+ base = kzalloc_node(sizeof(*base), GFP_KERNEL,
+ cpu_to_node(cpu));
+ else
+ base = kzalloc(sizeof(*base), GFP_KERNEL);
+
if (!base)
return -ENOMEM;
@@ -1555,7 +1583,13 @@ static int init_timers_cpu(int cpu)
kfree(base);
return -ENOMEM;
}
- per_cpu(tvec_bases, cpu) = base;
+
+ if (cpu != NR_CPUS)
+ per_cpu(tvec_bases, cpu) = base;
+#ifdef CONFIG_SMP
+ else
+ tvec_base_deferral = base;
+#endif
} else {
/*
* This is for the boot CPU - we use compile-time
@@ -1570,7 +1604,12 @@ static int init_timers_cpu(int cpu)
tvec_base_done[cpu] = 1;
base->cpu = cpu;
} else {
- base = per_cpu(tvec_bases, cpu);
+ if (cpu != NR_CPUS)
+ base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+ else
+ base = tvec_base_deferral;
+#endif
}
@@ -1678,6 +1717,15 @@ void __init init_timers(void)
(void *)(long)smp_processor_id());
BUG_ON(err != NOTIFY_OK);
+#ifdef CONFIG_SMP
+ /*
+ * initialize cpu unbound deferrable timer base only when CONFIG_SMP.
+ * UP kernel handles the timers with cpu 0 timer base.
+ */
+ err = init_timers_cpu(NR_CPUS);
+ BUG_ON(err);
+#endif
+
init_timer_stats();
register_cpu_notifier(&timers_nb);
open_softirq(TIMER_SOFTIRQ, run_timer_softirq);
--
On Tue, Mar 24, 2015 at 12:36:59PM -0700, Saravana Kannan wrote:
> On 03/19/2015 05:10 PM, Saravana Kannan wrote:
> >On 09/23/2014 11:33 AM, Thomas Gleixner wrote:
> >>On Mon, 15 Sep 2014, Joonwoo Park wrote:
> >>>+#ifdef CONFIG_SMP
> >>>+static struct tvec_base *tvec_base_deferral = &boot_tvec_bases;
> >>>+#endif
> >>
> >>In principle I like the idea of a deferrable wheel, but this
> >>implementation is going to go nowhere.
> >
> >Hi Thomas,
> >
> >To give some more context:
> >
> >This bug is a serious pain in the a** for anyone using deferrable timers
> >or deferrable workqueues today for some periodic work and don't care for
> >which CPU the code runs in.
> >
> >Couple of examples of such issues in existing code:
> >
> >1) In a SMP system, CPUfreq governors (ondemand and conservative) end up
> >queueing a deferrable work on every single CPU and the first one to run
> >the deferrable workqueue cancels the work on all other CPUS, runs the
> >work and then sets up a workqueue on all the CPUs again for the next
> >sampling point.
> >
> >2) Devfreq actually doesn't work well today because it doesn't do this
> >nasty hack like CPUfreq. So, if the devfreq work happens to run on a CPU
> >that's typically idle, then it doesn't run for a long time. I've
> >actually seen logs where the devfreq polling interval is set to 20ms,
> >but ends up running 800ms later.
> >
> >Don't know how many other drivers are suffering from this bug. Yes,
> >IMHO, this is a bug. When the timer is CPU unbound, anyone who hasn't
> >looked at the actual timer code would assume that it'll run as long as
> >any CPU is busy.
> >
> >>First of all making it SMP only is silly. The deferrable stuff is a
> >>pain in other places as well.
> >>
> >>But whats way worse is:
> >>
> >>>+static inline void __run_timers(struct tvec_base *base, bool try)
> >>> {
> >>> struct timer_list *timer;
> >>>
> >>>- spin_lock_irq(&base->lock);
> >>>+ if (!try)
> >>>+ spin_lock_irq(&base->lock);
> >>>+ else if (!spin_trylock_irq(&base->lock))
> >>>+ return;
> >>
> >>Yuck. All cpus fighting about a single spinlock roughly at the same
> >>time? You just created a proper thundering herd cacheline bouncing
> >>issue.
> >
> >I agree, This is not good. I think a simple way to fix this is to have
> >only the CPU that's the jiffy owner to run through this deferrable timer
> >list.
> >
> >That should address any unnecessary cache bouncing issues. Would that be
> >an acceptable solution to this?
> >
> >>
> >>No way. We have already mechanisms in place to deal with such
> >>problems, you just have to use them.
> >
> >I'm not sure which problem you are referring to here. Or what the
> >already existing solutions are.
> >
> >I don't think you were referring to the "deferrable timer getting
> >delayed for long periods despite CPUs being active" problem, because I
> >don't think we have a better solution than this patch (with the jiffy
> >owner CPU fix). Asking every driver that doesn't care which CPU the work
> >runs on to queue a work or set up a timer on every CPU is definitely not
> >a good solution -- that's spreading the complexity to every other driver
> >instead of fixing it in one place. And that causes unnecessary cache
> >pollution too.
> >
> >Thoughts? I would really like to see a fix for this go in soon so that
> >we can simplify cpufreq and have devfreq and other drivers work correctly.
> >
> >Thanks,
> >Saravana
> >
>
> Thomas,
>
> Bump.
>
> -Saravana
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/