Re: [BUG, bisect] hrtimer: severe lag after suspend & resume

From: John Stultz
Date: Fri Jun 05 2015 - 14:52:16 EST


On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
>> On Thu, 4 Jun 2015, John Stultz wrote:
>> > On Wed, Jun 3, 2015 at 5:56 PM, Jeremiah Mahler <jmmahler@xxxxxxxxx> wrote:
>> > So I suspect the problem is the change to clock_was_set_seq in
>> > timekeeping_update is done prior to mirroring the time state to the
>> > shadow-timekeeper. Thus the next time we do update_wall_time() the
>> > updated sequence is overwritten by whats in the shadow copy. The
>> > attached patch moving the modification up seems to avoid the issue for
>> > me.
>>
>> Duh, yes.
>>
>> > Thomas: Looking at the problematic change, I'm not a big fan of it. Caching
>> > timekeeping state here in the hrtimer code has been a source of bugs in the
>> > past, and I'm not sure I see how avoiding copying 24bytes is that big of a
>> > win. Especially since it adds more state to the timekeeper and hrtimer base
>> > that we have to read and mange.
>>
>> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing.
>> In situations where we run high frequency periodic timers on clock monotonic and
>> nothing is going on in the other clock domains, which is a pretty common
>> situation, this is measurable in terms of cache utilization. [...]
>
> It's not just about 'touching': it's about _dirtying_ cachelines from a globally
> executed function (timekeeping), which is then accessed by per-CPU functionality
> (hrtimers).

Right, but part of that issue is that we're caching in the hrtimer cpu
bases data that *should not be cached*. That was the core issue that
caused the 2012 leapsecond issue, and I'd prefer to not reintroduce
it.

The offset data is only valid for the monotonic time its read for. So
dirtying three cache lines really is just due to the fact that the
data is stored in the cpu_base structure where I'd argue it doesn't
provide real value (other then convenience of indexing it cleanly).

Reading the offset data into three values from the stack would be fine
too, and (I think) would avoid dirtying much extra (we have to store
the now value anyway).

BTW: Thomas, what are you using to do measurements here? I hesitate to
argue in these sorts of performance discussions, since I really only
have a embarrassing theoretical understanding of the issues and
suspect myself a bit naive here. Additionally these sorts of
constraints aren't always clearly documented, so being able to measure
and compare would be helpful to ensure future changes don't impact
things here.

> That makes it far more expensive, it has similar scalability limiting effects as a
> global lock - while if we do it smart it can perform as essentially lockless code
> in most cases.

Another reason why I don't like this approach of caching the data is
that it also prevents fixing the leap-second adjustment to happen on
the second edge, because we have to have an async update to the
seqcounter in order to refresh the cached real_offset. Or we have to
also export more ntp state data so we can duplicate the adjustment to
the cached data in the hrtimer code, which is more of the complexity
you've objected to.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/