Re: [PATCH v2 0/4] x86/mtrr: Allow MTRR updates on multiple CPUs in parallel

From: H. Peter Anvin

Date: Thu Feb 12 2026 - 11:55:00 EST

On February 12, 2026 8:24:18 AM PST, "Jürgen Groß" <jgross@xxxxxxxx> wrote:
>On 10.02.26 08:28, H. Peter Anvin wrote:
>> On February 9, 2026 10:51:04 PM PST, "Jürgen Groß" <jgross@xxxxxxxx> wrote:
>>> On 09.02.26 19:37, H. Peter Anvin wrote:
>>>> On February 9, 2026 1:12:59 AM PST, Juergen Gross <jgross@xxxxxxxx> wrote:
>>>>> Ping?
>>>>>
>>>>> I'd really like to have this in 7.0, as it is fixing a real issue on
>>>>> some machines ...
>>>>>
>>>>>
>>>>> Juergen
>>>>>
>>>>> On 30.01.26 12:36, Juergen Gross wrote:
>>>>>> Today MTRR updates are serialized to not happen on multiple CPUs at the
>>>>>> same time, as the related coding is using global variables.
>>>>>>
>>>>>> On huge machines with lots of CPUs this can result in problems, as such
>>>>>> updates are happening through stop_machine(), which will call the MTRR
>>>>>> update function with interrupts off on all CPUs at the same time. The
>>>>>> interrupts will be switched on only after the last CPU has finished
>>>>>> the MTRR update. As the update is required to run in uncached mode, it
>>>>>> can take easily several milliseconds on each CPU, resulting in the
>>>>>> whole process to need several seconds. This in turn can cause the
>>>>>> watchdog to trigger and to recognize a hard system lockup.
>>>>>>
>>>>>> This series is changing the behavior by allowing the MTRR update to
>>>>>> happen on all CPUs in parallel.
>>>>>>
>>>>>> Changes in V2:
>>>>>> - fix a function comment header in patch 2
>>>>>>
>>>>>> Juergen Gross (4):
>>>>>> x86/mtrr: Move cache_enable() and cache_disable() to mtrr/generic.c
>>>>>> x86/mtrr: Introduce MTRR work state structure
>>>>>> x86/mtrr: Add a prepare_set hook to mtrr_ops
>>>>>> x86/mtrr: Drop cache_disable_lock
>>>>>>
>>>>>> arch/x86/include/asm/cacheinfo.h | 2 -
>>>>>> arch/x86/include/asm/mtrr.h | 2 -
>>>>>> arch/x86/kernel/cpu/cacheinfo.c | 80 +----------------
>>>>>> arch/x86/kernel/cpu/mtrr/generic.c | 139 ++++++++++++++++++++++++-----
>>>>>> arch/x86/kernel/cpu/mtrr/mtrr.c | 3 +
>>>>>> arch/x86/kernel/cpu/mtrr/mtrr.h | 2 +
>>>>>> 6 files changed, 122 insertions(+), 106 deletions(-)
>>>>>>
>>>>>
>>>>
>>>> First of all, what machines are even needing MTRR updates these days?
>>>
>>> I'm not aware this machine really needed an update.
>>>
>>>> This isn't a rhetorical question. It is important to understand what the underlying problem is.
>>>
>>> It just took several seconds for all CPUs to check if there is an update
>>> needed. It might be an issue with firmware, topology, whatever. It happened
>>> in a test doing 300 cold boots in a row after roughly 70 loop iterations,
>>> always on one of the last CPUs.
>>>
>>> The issue shows that there IS a potential problem with doing the MTRR
>>> update one CPU after the other, instead just doing it in parallel (which
>>> is the "official" recommendation anyway). See the comment in
>>> cache_disable(). And it isn't as if the fix would be very complicated.
>>>
>>>
>>> Juergen
>>
>> You are assuming that it won't break any fragile systems. I'm much more concerned about why this is happening at all.
>
>I'm having a hard time seeing why my series would break fragile systems.
>Its not as if I would change anything regarding the handling on each
>cpu.
>
>My main suspect why this is happening is the topology of the system
>(8 socket NUMA machine), causing the uncached memory accesses to have a
>rather high latency (multiple hops for accessing some memory), causing
>each cpu to need some time for checking all MTRRs.
>
>
>Juergen

Please stop avoiding the issue, which is WHY is this happening AT ALL on a recent production system.

The fastest way to do anything is to not do it at all.

What do the logs look like, with sufficient verbosity, for one thing?