Re: [RFC] time: xtime_lock is held too long

From: Eric Dumazet
Date: Fri May 06 2011 - 19:00:47 EST


Le vendredi 06 mai 2011 Ã 15:46 -0700, john stultz a Ãcrit :
> On Sat, 2011-05-07 at 00:30 +0200, Eric Dumazet wrote:
> > Le vendredi 06 mai 2011 Ã 13:24 -0700, john stultz a Ãcrit :
> >
> > > So would the easier solution be to just break out timekeeper locking
> > > from the xtime_lock?
> > >
> > > So basically we would just add a timekeeper.lock seqlock and use it to
> > > protect only the timekeeping code? We can still keep xtime_lock around
> > > for the tick/jiffies protection (well, until tglx kills jiffies :), but
> > > gettimeofday and friends wouldn't be blocked for so long.
> > >
> > > That should be pretty straight forward now that the timekeeper data is
> > > completely static to timkeeeping.c.
> > >
> >
> > Yes :)
> >
> > I can see many cpus entering tick_do_update_jiffies64() and all are
> > calling write_seqlock(&xtime_lock);
> >
> > Only first one can perform the work, but all others are waiting on the
> > spinlock, get it, change seqcount, and realize they have nothing to
> > do...
>
> Huh. So who is calling tick_do_update_jiffies64 in your case? I know the
> sched_tick_timer and tick_nohz_handler checks to make sure
> tick_do_timer_cpu == cpu to avoid exactly the thundering heard problem
> on the jiffies update.
>
> There's other spots that call tick_do_update_jiffies64, but I thought
> those were more rare. So there may be something else wrong going on
> here.
>


That I can answer :

echo function > /sys/kernel/debug/tracing/current_tracer

echo "do_timer calc_global_load second_overflow ktime_divns do_timestamp1 do_timestamp2 tick_do_update_jiffies64" >/sys/kernel/debug/tracing/set_ftrace_filter

echo 1 > /sys/kernel/debug/tracing/tracing_enabled
usleep 2000000;
echo 0 > /sys/kernel/debug/tracing/tracing_enabled
cat /sys/kernel/debug/tracing/trace >TRACE


(I added do_timestamp1/do_timestamp2) after/before write_seqlock()/write_sequnlock()

<idle>-0 [003] 920.355377: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [006] 920.355377: tick_do_update_jiffies64 <-tick_sched_timer
<idle>-0 [003] 920.355378: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [000] 920.355657: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.355660: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.355663: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.355666: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [003] 920.355781: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [003] 920.355783: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [004] 920.355796: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.355797: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.355808: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [004] 920.355933: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.355934: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.355935: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.355935: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.356093: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.356095: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.356198: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.356202: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.356211: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [002] 920.356319: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356321: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [003] 920.356372: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.356373: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [004] 920.356373: do_timer <-tick_do_update_jiffies64
<idle>-0 [004] 920.356374: calc_global_load <-do_timer
<idle>-0 [004] 920.356374: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [006] 920.356375: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [006] 920.356376: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [000] 920.356377: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [000] 920.356377: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [002] 920.356379: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [002] 920.356379: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [003] 920.356380: do_timestamp1 <-tick_do_update_jiffies64
<idle>-0 [003] 920.356380: do_timestamp2 <-tick_do_update_jiffies64
<idle>-0 [002] 920.356515: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356517: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [002] 920.356607: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [002] 920.356608: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [000] 920.356716: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [000] 920.356718: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [006] 920.356720: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [006] 920.356723: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
<idle>-0 [004] 920.356808: tick_do_update_jiffies64 <-tick_check_idle
<idle>-0 [004] 920.356810: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick


>
> > Meanwhile, a reader must wait that all writers are finished, because of
> > all seqcount changes storm.
> >
> > Following patch helps. Of course we might find out why so many cpus (on
> > my 8 cpus machine !) are calling tick_do_update_jiffies64() at the same
> > time...
> >
> >
> > This is basically what I said in my first mail :
> >
> > Separate logical sections to reduce windows where readers are blocked/spinning.
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index d5097c4..251b2fe 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -56,7 +56,7 @@ static void tick_do_update_jiffies64(ktime_t now)
> > return;
> >
> > /* Reevalute with xtime_lock held */
> > - write_seqlock(&xtime_lock);
> > + spin_lock(&xtime_lock.lock);
>
> Oof.. No, this is too ugly and really just abuses the seqlock structure.
>

That was a hack/POC, of course, but we can cleanup seqlock.h to provide
clean thing. A seqlock_t should include a seqcount_t and a spinlock_t.


diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index e98cd2e..afc00f4 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -29,8 +29,12 @@
#include <linux/spinlock.h>
#include <linux/preempt.h>

-typedef struct {
+typedef struct seqcount {
unsigned sequence;
+} seqcount_t;
+
+typedef struct {
+ seqcount_t seq;
spinlock_t lock;
} seqlock_t;

@@ -53,6 +57,22 @@ typedef struct {
#define DEFINE_SEQLOCK(x) \
seqlock_t x = __SEQLOCK_UNLOCKED(x)

+/*
+ * Sequence counter only version assumes that callers are using their
+ * own mutexing.
+ */
+static inline void write_seqcount_begin(seqcount_t *s)
+{
+ s->sequence++;
+ smp_wmb();
+}
+
+static inline void write_seqcount_end(seqcount_t *s)
+{
+ smp_wmb();
+ s->sequence++;
+}
+
/* Lock out other writers and update the count.
* Acts like a normal spin_lock/unlock.
* Don't need preempt_disable() because that is in the spin_lock already.
@@ -60,14 +80,12 @@ typedef struct {
static inline void write_seqlock(seqlock_t *sl)
{
spin_lock(&sl->lock);
- ++sl->sequence;
- smp_wmb();
+ write_seqcount_begin(&sl->seq);
}

static inline void write_sequnlock(seqlock_t *sl)
{
- smp_wmb();
- sl->sequence++;
+ write_seqcount_end(&sl->seq);
spin_unlock(&sl->lock);
}

@@ -75,10 +93,9 @@ static inline int write_tryseqlock(seqlock_t *sl)
{
int ret = spin_trylock(&sl->lock);

- if (ret) {
- ++sl->sequence;
- smp_wmb();
- }
+ if (ret)
+ write_seqcount_begin(&sl->seq);
+
return ret;
}

@@ -88,7 +105,7 @@ static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
unsigned ret;

repeat:
- ret = sl->sequence;
+ ret = sl->seq.sequence;
smp_rmb();
if (unlikely(ret & 1)) {
cpu_relax();
@@ -107,7 +124,7 @@ static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start)
{
smp_rmb();

- return unlikely(sl->sequence != start);
+ return unlikely(sl->seq.sequence != start);
}


@@ -118,10 +135,6 @@ static __always_inline int read_seqretry(const seqlock_t *sl, unsigned start)
* after the write_seqcount_end().
*/

-typedef struct seqcount {
- unsigned sequence;
-} seqcount_t;
-
#define SEQCNT_ZERO { 0 }
#define seqcount_init(x) do { *(x) = (seqcount_t) SEQCNT_ZERO; } while (0)

@@ -204,22 +217,6 @@ static inline int read_seqcount_retry(const seqcount_t *s, unsigned start)
}


-/*
- * Sequence counter only version assumes that callers are using their
- * own mutexing.
- */
-static inline void write_seqcount_begin(seqcount_t *s)
-{
- s->sequence++;
- smp_wmb();
-}
-
-static inline void write_seqcount_end(seqcount_t *s)
-{
- smp_wmb();
- s->sequence++;
-}
-
/**
* write_seqcount_barrier - invalidate in-progress read-side seq operations
* @s: pointer to seqcount_t



> If you really want to untangle what xtime_lock protects, you need to
> introduce a new lock (I suggest in the timekeeper structure) to protect
> the timekeeping data.
>
> Then we can refine xtime_lock to also just protect the jiffies/tick
> management bits as well if needed.

For the moment I am trying to understand the code. Its quite complex and
uses a monolithic seqlock, defeating seqlock power.

That was really my initial point, if you remember my mail...



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/