Re: [BUG] Boot hangs at clocksource_done_booting on large configs

From: Alex Thorlton
Date: Tue Sep 01 2015 - 13:33:11 EST


On Mon, Aug 31, 2015 at 11:12:12PM +0200, Thomas Gleixner wrote:
> On Mon, 31 Aug 2015, Alex Thorlton wrote:
> > I was able to hit this issue on 4.2-rc1 with our RTC disabled, to rule
> > out any scaling issues related to multiple concurrent reads to our
> > RTC's MMR.
>
> And to rule out scaling issues you replaced the RTC MMR with HPET. Not
> a very good choice:
>
> HPET does not scale either. It's uncached memory mapped I/O. See
> below.

Ahhh. I wondered if there might be an issue there. I can say that I've
seen it hit a very similar issue with some other possible solutions that
we've tried, but this was the most recent, fairly unmodified kernel that
I've played with, so I went with this.

> > I'm hoping to get some input from the experts in this area, first of
> > all, on whether the problem I'm seeing is actually what I think it is,
> > and, if so, if I've solved it in the correct way.
>
> I fear both the analysis and the solution is wrong.
>
> Up to the point where the actual clocksource change happens there is
> no reason why timer interrupts should not happen. And the code which
> actually changes the clocksource is definitely called with interrupts
> disabled. When that function returns the new clocksource is fully
> functional and interrupts can happen again.
>
> Now looking at your backtraces. Most CPUs are in the migration thread
> and a few (3073,3078,3079,3082) are in the idle task.
>
> From the trace artifacts (? read_hpet) it looks like the clock source
> change has been done and the cpus are on the way back from stop
> machine.

That's good to note. I'll keep that in mind as I continue gathering
info.

> But they are obviously held off by something. And that something looks
> like the timekeeper sequence lock.

Ok. IIRC, that lock prevents reads to the clocksource until all
necessary write operations have finished, correct? i.e. tk_core.seq?

I looked at that as being a possible place to get stuck in ktime_get:

do {
seq = read_seqcount_begin(&tk_core.seq);
base = tk->tkr_mono.base;
nsecs = timekeeping_get_ns(&tk->tkr_mono);

} while (read_seqcount_retry(&tk_core.seq, seq));

But I don't have enough evidence to point there for sure.

> Too bad, that we don't have a backtrace for CPU0 in the log.

I think that may have gotten swallowed up with all the lost printks that
you'll notice in there. Does the hung task backtrace not cover that
though?

> I really wonder how a machine that large works with HPET as
> clocksource at all. hpet_read() is uncached memory mapped IO which
> takes thousands of CPU cycles. Last time I looked it was around
> 1us. Let's take that number to do some math.
>
> If all CPUs do that access at the same time, then it takes NCPUS
> microseconds to complete if the memory mapped I/O scheduling is
> completely fair, which I doubt. So with 4k CPUs thats whopping 4.096ms
> and it gets worse if you go larger. That's more than a tick with
> HZ=250.
>
> I'm quite sure that you are staring at the HPET scalability bottleneck
> and not at some actual kernel bug.

That could well be the case.

> Your patch shifts some timing around so the issue does not happen, but
> that's certainly not a solution.

I was fairly sure of that, but it's good to know that you agree.

I'll try and get a backtrace from a completely unmodified kernel here in
the next few days. Unfortunately, as mentioned, we only hit the issue
intermittently. I was able to reproduce pretty regularly on kernels
from 4.0 and before, but it seems to be less common on newer kernels,
though we definitely still manage to hit it from time to time.

Thanks for all the help!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/