Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Andy Lutomirski
Date: Sat Dec 05 2020 - 19:37:51 EST
On Sat, Dec 5, 2020 at 3:15 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote:
>
> Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am:
> >
> If an mm was lazy tlb for a kernel thread and then it becomes unlazy,
> and if switch_mm is serialising but return to user is not, then you
> need a serialising instruction somewhere before return to user. unlazy
> is the logical place to add that, because the lazy tlb mm (i.e.,
> switching to a kernel thread and back without switching mm) is what
> opens the hole.
The issue here is that unlazying on x86 sometimes serializes and
sometimes doesn't. It's straightforward to add logic to the x86 code
to serialize specifically in the case where membarrier core sync is
registered and unlazying would otherwise not serialize, but trying to
define sensible semantics for this in a call to core code seems
complicated. (Specifically, the x86 code only sometimes sends IPIs to
lazy CPUs for TLB flushes. If an IPI is skipped, then unlazying will
flush the TLB, and that operation is serializing.
The whole lazy thing is IMO a red herring for membarrier(). The
membarrier() logic requires that switching *logical* mms
(rq->curr->mm) serializes before user mode if the new mm is registered
for core sync. AFAICT the only architecture on which this isn't
automatic is x86, and somehow the logic turned into "actually changing
rq->curr->mm serializes, but unlazying only sometimes serializes, so
we need to do an extra serialization step on unlazying operations"
instead of "tell x86 to make sure it always serializes when it
switches logical mms". The latter is easy to specify and easy to
implement.
>
> How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying
> arch code about when an mm becomes not-lazy, and nothing to do with
> membarrier at all even. It's a convenient hook to do your un-lazying.
> I guess you can do it also checking things in switch_mm and keeping state
> in arch code, I don't think that's necessarily the best place to put it.
I'm confused. I just re-read your patches, and it looks like you have
arch code calling exit_lazy_tlb(). On x86, if we do a TLB shootdown
IPI to a lazy CPU, the IPI handler will unlazy that CPU (by switching
to init_mm for real), and we have no way to notify the core scheduler
about this, so we don't. The result is that the core scheduler state
and the x86 state gets out of sync. If the core scheduler
subsequently switches us back to the mm that it thinks we were still
using lazily them, from the x86 code's perspective, we're not
unlazying -- we're just doing a regular switch from init_mm to some
other mm. This is why x86's switch_mm_irqs_off() totally ignores its
'prev' argument.
I'm honestly a bit surprised that other architectures don't do the
same thing. I suppose that some architectures don't use shootdown
IPIs at all, in which case there doesn't seem to be any good reason to
aggressively unlazy.
(Oddly, despite the fact that, since Ivy Bridge, x86 has a "just flush
the TLB" instruction, that instruction is considerably slower than the
old "switch mm and flush" operation. So the operation "switch to
init_mm" is only ever any slower than "flush and stay lazy" if we get
lucky and unlazy to the same mm before we get a second TLB shootdown
*and* if unlazying to the same mm would not have needed to flush. I
spend quite a bit of time tuning this stuff and being quite surprised
at the bizarre performance properties of Intel's TLB management
instructions.)
>
> So membarrier code is unchanged (it cares that the serialise is done at
> un-lazy time), core code is simpler (no knowledge of this membarrier
> quirk and it already knows about lazy-tlb so the calls actually improve
> the documentation), and x86 code I would argue becomes nicer (or no real
> difference at worst) because you can move some exit lazy tlb handling to
> that specific call rather than decipher it from switch_mm.
As above, I can't move the exit-lazy handling because the scheduler
doesn't know when I'm unlazying.
>
> >
> > I’m currently trying to document how membarrier actually works, and
> > hopefully this will result in untangling membarrier from mmdrop() and
> > such.
>
> That would be nice.
It's still a work in progress. I haven't actually convinced myself
that the non-IPI case in membarrier() is correct, nor have I convinced
myself that it's incorrect.
Anyway, I think that my patch is a bit incorrect and I either need a
barrier somewhere (which may already exist) or a store-release to
lazy_mm to make sure that all accesses to the lazy mm are done before
lazy_mm is freed. On x86, even aside from the fact that all stores
are releases, this isn't needed -- stopping using an mm is itself a
full barrier. Will this be a performance issue on power?
>
> >
> > A silly part of this is that x86 already has a high quality
> > implementation of most of membarrier(): flush_tlb_mm(). If you flush
> > an mm’s TLB, we carefully propagate the flush to all threads, with
> > attention to memory ordering. We can’t use this directly as an
> > arch-specific implementation of membarrier because it has the annoying
> > side affect of flushing the TLB and because upcoming hardware might be
> > able to flush without guaranteeing a core sync. (Upcoming means Zen
> > 3, but the Zen 3 implementation is sadly not usable by Linux.)
> >
>
> A hardware broadcast TLB flush, you mean? What makes it unusable by
> Linux out of curiosity?
The new instruction is INVLPGB. Unfortunately, x86's ASID field is
very narrow, and there's no way we can give each mm the same ASID
across all CPUs, which means we can't accurately target the flush at
the correct set of TLB entries. I've asked engineers at both Intel
and AMD to widen the ASID field, but that will end up being
complicated -- x86 has run out of bits in its absurdly overloaded CR3
encoding, and widening the ASID to any reasonable size would require
adding a new way to switch mms. There are lots of reasons that x86
should do that anyway [0], but it would be a big project and I'm not
sure that either company is interested in big projects like that.
[0] On x86, you can't switch between (64-bit execution, 48-bit virtual
address space) and (64-bit execution, 57-bit address space) without
exiting 64-bit mode in the middle. This is because the way that the
addressing mode is split among multiple registers prevents a single
instruction from switching between the two states. This is absolutely
delightful for anyone trying to boot an OS on a system with a very,
very large amount of memory.