Re: Current mainline git (24e700e291d52bd2) hangs when building e.g. perf

From: Andy Lutomirski
Date: Fri Sep 08 2017 - 20:00:32 EST

On Fri, Sep 8, 2017 at 4:23 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Sep 8, 2017 at 4:07 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> I *think* this is impossible because CPU A's mm_cpumask manipulations
>> are atomic and should therefore force out the streaming write buffers,
>> but maybe there's some other scenario where this matters.
> I don't think atomic memops do that.
> They enforce globally visible ordering, but since they happen in the
> cache and is not actually visible to outside, that doesn't actually
> affect any streaming write buffers.
> Then, if somebody else requests a cacheline that we have exclusive
> ownership to, the write buffers just need to flush before we give up
> that cacheline.
> So a locked memory op is *not* serializing, it only enforces memory
> ordering. Big difference.
> Only fully serializing instructions will serialize with the write
> buffers, and they are expensive as hell (partly exactly _due_ to these
> kinds of issues).

I'm not convinced. The SDM says (Vol 3, 11.3, under WC):

If the WC buffer is partially filled, the writes may be delayed until
the next occurrence of a serializing event; such as, an SFENCE or
MFENCE instruction, CPUID execution, a read or write to uncached
memory, an interrupt occurrence, or a LOCK instruction execution.

Thanks, Intel, for definiing "serializing event" differently here than
anywhere else in the whole manual.

Anyhow, I can think of two cases where this is relevant.

1. The kernel wants to reclaim a page of normal memory, so it unmaps
it and flushes. Another CPU has an entry for that page in its WC
buffer. I don't think we care whether the flush causes the WC write
to really hit RAM because it's unobservable -- we just need to make
sure it is ordered, as seen by software, before the flush operation
completes. From the quote above, I think we're okay here.

2. The kernel is unmapping some IO memory (e.g. a GPU command buffer).
It wants a guarantee that, when flush_tlb_mm_range returns, all CPUs
are really done writing to it. Here I'm less convinced. The SDM
quote certainly suggests to me that we have a promise that the WC
write has *started* before flush_tlb_mm_range returns, but I'm not
sure I believe that it's guaranteed to have retired. That being said,
I'm not sure that this is observable either -- anything the kernel
does that depends on the writes being done presumably has to involve
further IO to the same device, and I suspect that WC write; lock write
to memory; observe that write on another CPU; IO on other CPU really
does guarantee that everything hits the bus in order.

FWIW, I'm not sure that we ever had a guarantee that IO writes were
all fully done before flush_tlb_mm_range would return. Can't they
still be hanging out in the PCI bridge or whatever until someone
*reads* that device?

> So this change to delay invalidation does sound fairly scary..

With PCID, we're fundamentally delaying invalidation. I think the
worry is more that we're not guaranteeing that every CPU that could
have accessed the page being flushed has executed a real serializing

If we want to force invalidation/serialization, I can see two
reasonable solutions:

1. Revert this behavior change: continue sending IPIs to lazy CPUs.
The problem is that this will totally wipe out the performance gain,
and that gain seemed to be substantial in some microbenchmarks at

2. Get rid of lazy mode. With PCID at least, switching to init_mm isn't so bad.

I'd prefer to leave it as is except on the buggy AMD CPUs, though,
since the current code is nice and fast.