Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
From: David Hildenbrand
Date: Wed Oct 29 2025 - 08:15:51 EST
On 29.10.25 11:44, Heiko Carstens wrote:
On Wed, Oct 29, 2025 at 10:57:15AM +0100, David Hildenbrand wrote:
On 28.10.25 22:15, Luiz Capitulino wrote:...
A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
on s390. The crash and the proposed fix were worked on an s390 KVM guest
running on an older hypervisor, as I don't have access to an LPAR. However,
the same issue should occur on bare-metal.
This commit fixes this by implementing flush_tlb_all() on s390 as an
alias to __tlb_flush_global(). This should cause a flush on all TLB
entries on all CPUs as expected by the flush_tlb_all() semantics.
Fixes: f13b83fdd996 ("hugetlb: batch TLB flushes when freeing vmemmap")
Signed-off-by: Luiz Capitulino <luizcap@xxxxxxxxxx>
---
Nice finding!
Makes me wonder whether the default flush_tlb_all() should actually map to a
BUILD_BUG(), such that we don't silently not-flush on archs that don't
implement it.
Which default flush_tlb_all()? :)
What I meant is: all such functions that an architecture doesn't expect to be called because they are effectively unimplemented.
Taking a look at flush_tlb_all(), there is really only a dummy implementation on s390x and on riscv without MMU.
So yeah, there is no "default" fallback one :)
BTW, I'm staring at s390x's flush_tlb() function and wonder why that one is defined. I'm sure there is a good reason ;)
There was a no-op implementation for s390, and besides drivers/xen/balloon.c
there is only mm/hugetlb_vmemmap.c in common code which makes use of this. To
me it looks like both call sites only need to flush TLB entries of the kernel
address space. So I'd rather prefer if flush_tlb_all() would die instead.
I'd assume that we only modify the kernel virtual address space, so I agree.
But I'm also wondering about the correctness of the whole thing even with this
patch. If I'm not mistaken then vmemmap_split_pmd() changes an active pmd
entry of the kernel mapping. That is: an active leaf entry (aka large page) is
changed to an active entry pointing to a page table.
That's my understanding as well.
Changing active entries without the detour over an invalid entry or using
proper instructions like crdte or cspg is not allowed on s390. This was solved
for other parts that change active entries of the kernel mapping in an
architecture compliant way for s390 (see arch/s390/mm/pageattr.c).
Good point. I recall ARM64 has similar break-before-make requirements because they cannot tolerate two different TLB entries (small vs. large) for the same virtual address.
And if I rememebr correctly, that's the reason why arm64 does not enable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP just yet.
--
Cheers
David / dhildenb