Re: Prevent inconsistent CPU state after sequence of dlclose/dlopen

From: Mark Rutland
Date: Fri Jan 10 2025 - 13:57:21 EST


Hi Peter,

[adding Andy and Will, since we've discussed related cases in the past]

On Fri, Jan 10, 2025 at 06:11:12PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 10, 2025 at 12:02:27PM -0500, Mathieu Desnoyers wrote:
> > On 2025-01-10 11:54, Peter Zijlstra wrote:
> > > On Fri, Jan 10, 2025 at 10:55:36AM -0500, Mathieu Desnoyers wrote:
> > > > Hi,
> > > >
> > > > I was discussing with Mark Rutland recently, and he pointed out that a
> > > > sequence of dlclose/dlopen mapping new code at the same addresses in
> > > > multithreaded environments is an issue on ARM, and possibly on Intel/AMD
> > > > with the newer TLB broadcast maintenance.
> > >
> > > What is the exact race? Should not munmap() invalidate the TLBs before
> > > it allows overlapping mmap() to complete?
> >
> > The race Mark mentioned (on ARM) is AFAIU the following scenario:
> >
> > CPU 0 CPU 1
> >
> > - dlopen()
> > - mmap PROT_EXEC @addr
> > - fetch insn @addr, CPU state expects unchanged insn.
> > - execute unrelated code
> > - dlclose(addr)
> > - munmap @addr
> > - dlopen()
> > - mmap PROT_EXEC @addr
> > - fetch new insn @addr. Incoherent CPU state.

For the benefit of others, what I specifically said was:

| There's a fun (latent, been around forever) issue whereby reusing the
| same VA for different code (e.g. dlopen() ... dlclose() ... dlopen())
| could blow up in a multi-threaded environment

I hadn't reported this on a list yet because there are many subtleties,
this is vanishingly unlikely to occur in practice today, and I didn't
want to get people excited/confused/angry over an incomplete or
misleading description.

> Urgh.. Mark, is this because of non-coherent i-cache or somesuch misery?

Sort-of.

The key detail is that while instructions are being executed (including
speculative execution), the CPU pipeline/OoO-engine/whatever effectively
caches a copy of an instruction while it is "in-flight" (e.g.
potentially broken down into micro-ops):

On the ARM architecture, those in-flight copies are only guaranteed to
be discarded by a context-synchronization-event, and are not guaranteed
to be discarded due to TLB maintenance, data cache maintenance, or
instruction cache maintenance. Instruction cache maintenance will
guarantee that *subsequent* fetches from any instruction cache observe
the new value.

The first time a page of executable code is mapped in at a VA, this
isn't a problem because there was nothing previously at that VA which
could have been fetched from (since entering userspace, as exception
return from kernel to user provides a context-synchronization-event).

However, if some code A is mapped at a VA, then unmapped, then some
distinct code B is mapped at that VA, then some CPUs might still have
code A in-flight, regardless of TLB and cache maintenance, unless a
context-synchronization-event occurs.

Imagine you have a CPU microarchitecture with a long speculative
execution window, and you have to threads running pinned on two CPUs
sharing an address space, with some shared function pointer P which is
initially NULL. Then you have something like:

Thread 0 Thread 1

// Speculating some long-to-resolve
// sequence of instructions.
- mmap() code A at VA X
- enters kernel
- kernel loads data into page
- kernel performs D$ + I$ maintenance
- kernel updates page tables
- returns to userspace

// Begins speculating if (P) { P() };

// Begins speculating P(), predicted as
// VA X.

// Fetches code A from X into pipeline
// Code A now in-flight
- munmap() VA X
- enters kernel
- updates page tables
- performs TLB maintenance (broadcast)
- returns to userspace

// Code A still in-flight

- mmap() code B at VA X
- enters kernel
- kernel loads data into page
- kernel performs D$ + I$ maintenance
// Code A no longer in I$
// Code A still in-flight
- kernel updates page tables
- returns to userspace

// Code A still in-flight

- Publishes P as pointer in code B
(e.g. WRITE_ONCE(P, X))

// Completes speculation of
// long-to-resolve sequence.

// Resolves P is VA X, and
// commits speculation of code A


... and BANG, stale instructions executed.

Note that on architectures that use IPIs for TLB invalidation (e.g. x86
today), the munmap() is likely to provide the serialization to discard
the in-flight copy by virtue of the IPI.

Practically speaking, actually hitting this is very unlikely because you
need to get very unlucky with predication, the predicted instructions
need to remain in-flight for a very long time without being discarded
for other reasons (e.g. a mis-prediction, IRQ, etc), and the VA needs to
be reused within that time window.

> But shouldn't flush_{,i}cache_range() or something along those lines not
> handle this?

Unfortunately not, those only affect the explicit data/instruction
caches, and not the in-flight copies of the instructions.

Mark.