Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0
From: Barry Song
Date: Thu Jul 25 2024 - 02:21:57 EST
On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@xxxxxxxx> wrote:
>
> From: "Hailong.Liu" <hailong.liu@xxxxxxxx>
>
> The scenario where the issue occurs is as follows:
> CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
> __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
> vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
> and phys_addr is aligned with PMD_SIZE
> vmap_pages_range
> vmap_pages_range_noflush
> __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
>
> In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> might be issues with the __vmap_pages_range_noflush().
>
> The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> are several reasons for this:
> - This increases memory footprint because ALIGNMENT.
> - This increases the likelihood of kvmalloc allocation failures.
> - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> Besides if drivers want to vmap huge, user vmalloc_huge instead.
>
> Fix it by disabling fallback and remove VM_ALLOW_HUGE_VMAP in
> kvmalloc_node().
> Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
>
> CC: Barry Song <21cnbao@xxxxxxxxx>
> CC: Baoquan He <bhe@xxxxxxxxxx>
> CC: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Reported-by: Tangquan.Zheng <zhengtangquan@xxxxxxxx>
> Signed-off-by: Hailong.Liu <hailong.liu@xxxxxxxx>
The implementation of HUGE_VMAP appears to be quite disorganized.
A major change is needed.
1. when allocating 2.1MB kvmalloc, we shouldn't allocate 4MB memory, which
is now done by HUGE_VMAP. This is even worse than PMD-mapped THP for
userspace. We don't even do this for THP. vmap could be done by 1PMD map
+ 0.1MB PTE mapping instead.
2. We need to allow fallback to order-0 pages if we're unable to allocate 2MB.
In this case, we should perform PMD/PTE mapping based on how the pages
are acquired, rather than assuming they always form contiguous 2MB blocks.
3. Memory is entirely corrupted after Michael's "mm, vmalloc: fix high order
__GFP_NOFAIL allocations". but without it, forcing 2MB allocation was
making OOM.
> ---
> mm/util.c | 2 +-
> mm/vmalloc.c | 9 ---------
> 2 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 669397235787..b23133b738cf 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -657,7 +657,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
> * protection games.
> */
> return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> - flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> + flags, PAGE_KERNEL, 0,
> node, __builtin_return_address(0));
I'd vote +1 for this. we don't want to waste memory, for example, wasting
1.9MB memory while allocating 2.1MB kvmalloc. but this should be a separate
patch.
> }
> EXPORT_SYMBOL(kvmalloc_node);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 03c78fae06f3..1914768f473e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3577,15 +3577,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> page = alloc_pages(alloc_gfp, order);
> else
> page = alloc_pages_node(nid, alloc_gfp, order);
> - if (unlikely(!page)) {
> - if (!nofail)
> - break;
> -
> - /* fall back to the zero order allocations */
> - alloc_gfp |= __GFP_NOFAIL;
> - order = 0;
> - continue;
> - }
>
> /*
> * Higher order allocations must be able to be treated as
> --
> After 1) I check the code and I can't find a resonable band-aid to fix
> this. so the v2 patch works but ugly. Glad to hear a better solution :)
This is still incorrect because it undoes Michal's work. We also need to break
the loop if (!nofail), which you're currently omitting.
To avoid reverting Michal's work, the simplest "fix" would be,
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index caf032f0bd69..0011ca30df1c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
size, unsigned long align,
return NULL;
}
- if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
+ if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
!(gfp_mask & __GFP_NOFAIL)) {
unsigned long size_per_node;
/*
>
> [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@xxxxxxxx/
> 2.34.1
Thanks
Barry