Re: [GIT pull] timer updates for 4.9

From: Thomas Gleixner
Date: Mon Oct 24 2016 - 05:42:30 EST

Next message: Jiri Olsa: "Re: [GIT PULL 00/52] New Tool: perf c2c"
Previous message: Hannes Frederic Sowa: "Re: [PATCH net-next 4/5] ipv6: Compute multipath hash for sent ICMP errors from offending packet"
Next in thread: Thomas Gleixner: "Re: [GIT pull] timer updates for 4.9"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, 23 Oct 2016, Linus Torvalds wrote:
> So I found what looks like a bug in lock_timer_base() wrt migration.
>
> This code:
>
> for (;;) {
> struct timer_base *base;
> u32 tf = timer->flags;
>
> if (!(tf & TIMER_MIGRATING)) {
> base = get_timer_base(tf);
> spin_lock_irqsave(&base->lock, *flags);
> if (timer->flags == tf)
> return base;
> spin_unlock_irqrestore(&base->lock, *flags);
> }
> cpu_relax();
> }
>
> looks subtly buggy. I think that load of "tf" needs a READ_ONCE() to
> make sure that gcc doesn't simply reload the valid of "timer->flags"
> at random points.

You are right, that needs a READ_ONCE(). Stupid me.

> Yes, the spin_lock_irqsave() is a barrier, but that's the only one.
> Afaik, gcc could decide that "I need to spill tf, so I'll just reload
> it" after looking up get_timer_base().
>
> And no, I don't think this is the cause of my problem, but I suspect
> that something _like_ fragility in lock_timer_base() could cause this.

It might explain it, when this really ends up with the wrong base.

> I dunno. That whole thing looks very fragile to begin with: is it
> really ok to change the expiry time of a timer without holding any
> locks what-so-ever? The timer may just be firing on another CPU, and
> you may be setting the expiry time on a timer that isn't ever going to
> fire again.

A timer firing on the other CPU is not an issue, but yes, we should not do
that unlocked. This certainly is a naive over optimization.

I'll go through the locking once more with a fine comb and send out fixes
later today along with another NOHZ issue which I decoded over the weekend.

> And there may well be valid reasons why I'm full of crap, and there's
> some reason why this is all safe. Maybe the GP fault I saw was my
> fault after all, in some way that I can't for the life of me figure
> out right now..

Is your pr_cont thing serialized or does it rely on the timer locking (or
the lack of it)?

Thanks,

tglx

Next message: Jiri Olsa: "Re: [GIT PULL 00/52] New Tool: perf c2c"
Previous message: Hannes Frederic Sowa: "Re: [PATCH net-next 4/5] ipv6: Compute multipath hash for sent ICMP errors from offending packet"
Next in thread: Thomas Gleixner: "Re: [GIT pull] timer updates for 4.9"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]