Re: PROBLEM: Probabilistic segfault on AMD hardware with INVLPGB

From: Rik van Riel

Date: Mon Jun 22 2026 - 15:03:01 EST


On Fri, 2026-06-19 at 15:17 +0200, Henrik Böving wrote:
> [1.] One line summary of the problem
> Probabilistic segfault on AMD hardware with INVLPGB
>
> [2.] Full description of the problem/report:
>
> Since upgrading from kernel 6.14 to 7.1, we've been observing
> probabilistic segfaults on our build machines. We were able to
> reproduce
> these segfaults on two copies of the same hardware, equipped with an
> AMD EPYC 9455 48-Core Processor. Subsequently, we managed to bisect
> the
> segfault down to the introduction of INVLPGB-based page invalidation
> by
> enabling/disabling BROADCAST_TLB_FLUSH on kernel 6.15. The segfaults
> also do not reproduce on hardware that does not have the INVLPGB
> instruction.
>
We have also seen increased segfault rates when we
first deployed Turin CPUs. This turned out to be
AMD erratum AMD-SB-3029.

Upgrading the firmware on the host solved those
segfaults for us.

Doing a bulk data analysis across the Meta fleet
shows that Turin segfault rates are in the same
range as those of other CPU microarchitectures,
of several vendors.

https://www.amd.com/en/resources/product-security/bulletin/amd-sb-3029.html

Now, it is possible that INVLPGB has increased
the likelihood of hitting some race condition, and
it also possible that the kernel has a TLB
flushing bug somewhere, but it may make sense
to start with a firmware upgrade.

--
All Rights Reversed.