Re: [PATCH 1/1] x86/mm/pae: Align up pteval_t, pmdval_t and pudval_t to avoid split locks

From: Javier Pello
Date: Wed Apr 03 2024 - 11:54:51 EST


On Tue, 2 Apr 2024 10:43:26 -0700, Dave Hansen wrote:

> Fair enough. I can totally understand wanting the convenience.
> But you're leaving _so_ much performance on the floor that split
> locks are the least of your problems.

Split locks may not be a problem, but tasks stuck in kernel mode
are. :-)

> > I see. So just annotating the variable on the stack with
> > __aligned(8) should do it? But the code is under mm/, so it
> > should be arch-agnostic, right? What would the correct fix be,
> > then? I take from your message that using atomics through
> > pmd_populate() here is not needed, but what accessors should
> > be used instead? I am not familiar at all with this part of the
> > kernel.
>
> I don't think there's a better accessor.

That is what I thought, too, after looking through the code.

My assessment, then, is the following: The code in mm/huge_memory.c
uses a variable of type pmd_t on the stack, and uses pmd_populate()
to initialise it. On x86-32 with PAE, that variable is 8 bytes in
size but has an alignment of 4 bytes, and pmd_populate(), through
native_set_pmd(), uses an 8-byte atomic instruction on it. If the
variable is unlucky enough to span two cache lines, this triggers
an exception in kernel mode, and the kernel oopses. This is not
merely a false positive: a task gets stuck (repeatedly, in my
case) with user-visible consequences, and that is not nice.

I can see three ways forward:

- The first way to handle this (and the one requiring most work)
is to get rid of the lock altogether. There are two places in
mm/huge_memory.c (in __split_huge_zero_page_pmd, around line 2078,
and in __split_huge_pmd_locked, around line 2237) where a
temporary pmd_t is passed to pmd_populate() just to get a pte_t
for an address:

pmd_populate(mm, &_pmd, pgtable);
pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte);

Variable _pmd is never used afterwards in either of the cases.
Such a temporary variable is not subject to concurrent access,
but (correct me if I am wrong) pmd_populate() is designed for the
general case and uses atomics, at least on x86. Avoiding atomics
here would benefit everyone, since they are not needed, as you
pointed out, but pmd_populate() is an arch hook, so the new
"pmd_populate+pte_offset_map without an intermediate pmd_t" would
also have to be. A possible course of action for this would be to
first add a generic pmd_populate+pte_offset_map implementation in
mm/pgtable-generic.c inside a __HAVE_ARCH_... guard and then add
per-arch specific variants. This would remove the need for
alignment because the new hook could dispense with atomics
entirely, knowing that it is dealing with a local variable.

- The second way would be to keep the use of pmd_populate() as it
is, but make the variable aligned so that it does not trip over a
split lock. This is what my original patch intended to do, because
I had not realised that the lock is not needed, as you pointed out.
On my system, this increases the size of the kernel code by 7K, but
increasing the alignment of a variable should not have any other
impact.

- The third way would be to disable split lock detection on x86-32.
This can be as simple as setting the default to "none" in
sld_state_setup(). Not the most elegant of solutions, but beats
having unresponsive tasks.

Would going for the first one be worth the trouble?

Regards,
Javier