Re: frequent lockups in 3.18rc4

From: Chris Mason
Date: Thu Dec 04 2014 - 09:58:19 EST

On Thu, Dec 4, 2014 at 12:49 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Wed, Dec 3, 2014 at 7:15 PM, Chris Mason <clm@xxxxxx> wrote:

One guess is that trinity is generating a huge number of tlb
invalidations over sparse and horrible ranges. Perhaps the old code was
falling back to full tlb flushes before Dave Hansen's string of fixes?

Hmm. I agree that we've had some of the backtraces look like TLB
flushing might be involved. Not all, though. And I'm not seeing where
a loop over up to 33 pages should matter over doing a full TLB flush.

What *might* matter is if we somehow get that number wrong, and the loops like

addr = f->flush_start;
while (addr < f->flush_end) {
addr += PAGE_SIZE;

ends up looping a *lot* due to some bug, and then the IPI itself would
take so long that the watchdog could trigger.

But I do not see how that could actually happen. As far as I can tell,
either the number of pages is limited to less than 33, or we have that

Do you see something I don't?

Sadly not. Looking harder, I'm pretty sure all of the flushes coming through from this path are single page flushes anyway. So the most likely explanation is that we're waiting on the remote CPU, who is stuck somewhere secret.


