Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0
From: Barry Song
Date: Thu Jul 25 2024 - 06:22:42 EST
On Thu, Jul 25, 2024 at 5:58 PM Hailong Liu <hailong.liu@xxxxxxxx> wrote:
>
> On Thu, 25. Jul 21:34, Barry Song wrote:
> > On Thu, Jul 25, 2024 at 9:17 PM Hailong Liu <hailong.liu@xxxxxxxx> wrote:
> > >
> > > On Thu, 25. Jul 18:21, Barry Song wrote:
> > > > On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@xxxxxxxx> wrote:
> > > [snip]
> > > >
> > > > This is still incorrect because it undoes Michal's work. We also need to break
> > > > the loop if (!nofail), which you're currently omitting.
> > >
> > > IIUC, the origin issue is to fix kvcalloc with __GFP_NOFAIL return NULL.
> > > https://lore.kernel.org/all/ZAXynvdNqcI0f6Us@xxxxxxxxxxxxxx/T/#u
> > > if we disable huge flag in kmalloc_node, the issue will be fixed.
> >
> > No, this just bypasses kvmalloc and doesn't solve the underlying issue. Problems
> > can still be triggered by vmalloc_huge() even after the bypass. Once we
> > reorganize vmap_huge to support the combination of PMD and PTE
> > mapping, we should re-enable HUGE_VMAP for kvmalloc.
> Totally agree, This will take some time to support. As in [1] I prepare to fix
> with a offset in page_private to indicate the location of fallback.
>
> >
> > I would consider dropping VM_ALLOW_HUGE_VMAP() for kvmalloc as
> > an short-term "optimization" to save memory rather than a long-term fix. This
> > 'optimization' is only valid until we reorganize HUGE_VMAP in a way
> > similar to THP. I mean, for a 2.1MB kvmalloc, we can map 2MB as PMD
> > and 0.1 as PTE.
> However this just fixed the kvmalloc_node, but for others who call
> vmalloc_huge(), the issue exits. so I remove the Michal's code. sorry for this.
My proposal was to fallback to order-0 for __GFP_NOFAIL even before
vm_area_alloc_pages() as a short-term quick "fix".
We need to meet three conditions to do HUGE_VMAP
1. vmap_allow_huge
2. vm_flags & VM_ALLOW_HUGE_VMAP
3. !__GFP_NOFAIL gfp_flags
This is because if we fallback within vm_area_alloc_pages(), the
caller still expects
vm_area_alloc_pages() to return contiguous 2MB memory. By removing this
assumption from its callers, its caller will realize
vm_area_alloc_pages() is returning
small pages. That means, vm_area gets 0 as page_order from the first
beginning if we
have __GFP_NOFAIL in gfp_flags.
Other fixes appear to require significant changes to the source code
and can't be
done quickly.
>
> >
> > > >
> > > > To avoid reverting Michal's work, the simplest "fix" would be,
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index caf032f0bd69..0011ca30df1c 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
> > > > size, unsigned long align,
> > > > return NULL;
> > > > }
> > > >
> > > > - if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
> > > > + if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
> > > > !(gfp_mask & __GFP_NOFAIL)) {
> > > > unsigned long size_per_node;
> > > >
> > > > /*
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@xxxxxxxx/
> > > > > 2.34.1
> > > >
> > > > Thanks
> > > > Barry
> > >
> > > --
> > > help you, help me,
> > > Hailong.
>
> --
> help you, help me,
> Hailong.