Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

From: Mark Rutland
Date: Wed Feb 07 2024 - 07:24:38 EST


On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote:
> On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote:
> > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote:
> > > >
> > > > On 2024/1/26 2:06, Catalin Marinas wrote:
> > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote:
> > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary
> > > > > > BBM(break-before-make) logic when changing page tables.
> > > > > > This set of patches fix this by adding necessary BBM sequence when
> > > > > > changing page table, and supporting vmemmap page fault handling to
> > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed.
> > > > > I'm not keen on this approach. I'm not even sure it's safe. In the
> > > > > second patch, you take the init_mm.page_table_lock on the fault path but
> > > > > are we sure this is unlocked when the fault was taken?
> > > > I think this situation is impossible. In the implementation of the second
> > > > patch, when the page table is being corrupted
> > > > (the time window when a page fault may occur), vmemmap_update_pte() already
> > > > holds the init_mm.page_table_lock,
> > > > and unlock it until page table update is done.Another thread could not hold
> > > > the init_mm.page_table_lock and
> > > > also trigger a page fault at the same time.
> > > > If I have missed any points in my thinking, please correct me. Thank you.
> > >
> > > It still strikes me as incredibly fragile to handle the fault and trying
> > > to reason about all the users of 'struct page' is impossible. For example,
> > > can the fault happen from irq context?
> >
> > The pte lock cannot be taken in irq context (which I think is what
> > you're asking?) While it is not possible to reason about all users of
> > struct page, we are somewhat relieved of that work by noting that this is
> > only for hugetlbfs, so we don't need to reason about slab, page tables,
> > netmem or zsmalloc.
>
> My concern is that an interrupt handler tries to access a 'struct page'
> which faults due to another core splitting a pmd mapping for the vmemmap.
> In this case, I think we'll end up trying to resolve the fault from irq
> context, which will try to take the spinlock.

I think that (as per my comments on patch 2), a similar deadlock can happen on
RT even if the vmemmap is only accessed in regular process context, and at
minimum this needs better comentary and/or lockdep assertions.

I'd also prefer that we dropped this for now.

> Avoiding the fault would make this considerably more robust and the
> architecture has introduced features to avoid break-before-make in some
> circumstances (see FEAT_BBM and its levels), so having this optimisation
> conditional on that would seem to be a better approach in my opinion.

FWIW, that's my position too.

Mark.