Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
From: Will Deacon
Date: Tue Mar 26 2024 - 08:54:23 EST
On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
> On 2024/3/14 7:32, David Rientjes wrote:
> > On Thu, 8 Feb 2024, Will Deacon wrote:
> > > > How about take a new lock with irq disabled during BBM, like:
> > > >
> > > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> > > > +{
> > > > + (NEW_LOCK);
> > > > + pte_clear(&init_mm, addr, ptep);
> > > > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > > > + set_pte_at(&init_mm, addr, ptep, pte);
> > > > + spin_unlock_irq(NEW_LOCK);
> > > > +}
> > > I really think the only maintainable way to achieve this is to avoid the
> > > possibility of a fault altogether.
> > >
> > Nanyong, are you still actively working on making HVO possible on arm64?
> >
> > This would yield a substantial memory savings on hosts that are largely
> > configured with hugetlbfs. In our case, the size of this hugetlbfs pool
> > is actually never changed after boot, but it sounds from the thread that
> > there was an idea to make HVO conditional on FEAT_BBM. Is this being
> > pursued?
> >
> > If so, any testing help needed?
> I'm afraid that FEAT_BBM may not solve the problem here, because from Arm
> ARM,
> I see that FEAT_BBM is only used for changing block size. Therefore, in this
> HVO feature,
> it can work in the split PMD stage, that is, BBM can be avoided in
> vmemmap_split_pmd,
> but in the subsequent vmemmap_remap_pte, the Output address of PTE still
> needs to be
> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
> understanding
> of ARM FEAT_BBM is wrong, and I hope someone can correct me.
> Actually, the solution I first considered was to use the stop_machine
> method, but we have
> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
> use hugepages,
> so I have to consider performance issues. If your product does not change
> the amount of huge
> pages after booting, using stop_machine() may be a feasible way.
> So far, I still haven't come up with a good solution.
Oh, I hadn't appreciated that you needed to remap the memmap live. How
do you synchronise the two copies in that case? I think we (i.e. the arch
folks) probably need some more explanation on exactly who can race with
what here, otherwise I don't grok how this can work.
Thanks,
Will