Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

From: Jeff Merkey
Date: Wed Jan 20 2016 - 13:03:55 EST


On 1/20/16, John Stultz <john.stultz@xxxxxxxxxx> wrote:
> On Wed, Jan 20, 2016 at 9:42 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> wrote:
>> On Wed, 20 Jan 2016, John Stultz wrote:
>>> Ehrm. A more productive route in solving this might be to cap the
>>> cycle delta we return from timekeeping_get_delta().
>>>
>>> We already do this in the CONFIG_DEBUG_TIMEKEEPING, but adding a
>>> simple check it to the non-debug case should be doable w/o adding too
>>> much overhead to this very hot path.
>>>
>>> Something like:
>>> if (delta > tkr->clock->max_cycles)
>>> delta = tkr->clock->max_cycles;
>>>
>>> return delta;
>>
>> Well, you can make CONFIG_KDB select CONFIG_DEBUG_TIMEKEEPING.
>
> True. And turning on DEBUG_TIMEKEEPING is probably the easiest thing
> for Jeff to try.
>
> Though, there's still the same issue w/ paused VMs. Most of the design
> for the timekeeping code has been that it can't properly function if
> you block update_wall_time() calls, but it shouldn't kill the box.
> With most clocksources, the issue is the counter wraps and we lose
> time. But in this case with the TSC its the *very* large cycle delta
> turning into a unexpectedly large nanosecond value.
>
> Hrm.. I do also wonder: the logarithmic accumulation chews through
> large cycle deltas efficiently, but it does have some design limits,
> so it might also hit the rails and take awhile to spin accumulating
> time with such large offsets.
>
> Jeff: Can you try the config option above to let me know if that
> avoids the issue? And if not, can you provide some analysis of what
> else is going on?
>
> thanks
> -john
>

Yes sir. I am changing the code and preparing to test this right now.
It will be about 4 hours before I have the results.

Thanks to you and Thomas for the help. I appreciate it.

Jeff