Re: [GIT pull] timer updates for 4.9
From: Thomas Gleixner
Date: Mon Oct 24 2016 - 05:42:30 EST
On Sun, 23 Oct 2016, Linus Torvalds wrote:
> So I found what looks like a bug in lock_timer_base() wrt migration.
>
> This code:
>
> for (;;) {
> struct timer_base *base;
> u32 tf = timer->flags;
>
> if (!(tf & TIMER_MIGRATING)) {
> base = get_timer_base(tf);
> spin_lock_irqsave(&base->lock, *flags);
> if (timer->flags == tf)
> return base;
> spin_unlock_irqrestore(&base->lock, *flags);
> }
> cpu_relax();
> }
>
> looks subtly buggy. I think that load of "tf" needs a READ_ONCE() to
> make sure that gcc doesn't simply reload the valid of "timer->flags"
> at random points.
You are right, that needs a READ_ONCE(). Stupid me.
> Yes, the spin_lock_irqsave() is a barrier, but that's the only one.
> Afaik, gcc could decide that "I need to spill tf, so I'll just reload
> it" after looking up get_timer_base().
>
> And no, I don't think this is the cause of my problem, but I suspect
> that something _like_ fragility in lock_timer_base() could cause this.
It might explain it, when this really ends up with the wrong base.
> I dunno. That whole thing looks very fragile to begin with: is it
> really ok to change the expiry time of a timer without holding any
> locks what-so-ever? The timer may just be firing on another CPU, and
> you may be setting the expiry time on a timer that isn't ever going to
> fire again.
A timer firing on the other CPU is not an issue, but yes, we should not do
that unlocked. This certainly is a naive over optimization.
I'll go through the locking once more with a fine comb and send out fixes
later today along with another NOHZ issue which I decoded over the weekend.
> And there may well be valid reasons why I'm full of crap, and there's
> some reason why this is all safe. Maybe the GP fault I saw was my
> fault after all, in some way that I can't for the life of me figure
> out right now..
Is your pr_cont thing serialized or does it rely on the timer locking (or
the lack of it)?
Thanks,
tglx