Re: [PATCH] kho: add support for deferred struct page init

From: Pasha Tatashin

Date: Fri Dec 19 2025 - 11:29:27 EST

On Fri, Dec 19, 2025 at 4:19 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
>
> On Tue, Dec 16, 2025 at 10:36:01AM -0500, Pasha Tatashin wrote:
> > On Tue, Dec 16, 2025 at 10:19 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Dec 16, 2025 at 10:05:27AM -0500, Pasha Tatashin wrote:
> > > > > > +static struct page *__init kho_get_preserved_page(phys_addr_t phys,
> > > > > > + unsigned int order)
> > > > > > +{
> > > > > > + unsigned long pfn = PHYS_PFN(phys);
> > > > > > + int nid = early_pfn_to_nid(pfn);
> > > > > > +
> > > > > > + for (int i = 0; i < (1 << order); i++)
> > > > > > + init_deferred_page(pfn + i, nid);
> > > > >
> > > > > This will skip pages below node->first_deferred_pfn, we need to use
> > > > > __init_page_from_nid() here.
> > > >
> > > > Mike, but those struct pages should be initialized early anyway. If
> > > > they are not yet initialized we have a problem, as they are going to
> > > > be re-initialized later.
> > >
> > > Can say I understand your point. Which pages should be initialized earlt?
> >
> > All pages below node->first_deferred_pfn.
> >
> > > And which pages will be reinitialized?
> >
> > kho_memory_init() is called after free_area_init() (which calls
> > memmap_init_range to initialize low memory struct pages). So, if we
> > use __init_page_from_nid() as suggested, we would be blindly running
> > __init_single_page() again on those low-memory pages that
> > memmap_init_range() already set up. This would cause double
> > initialization and corruptions due to losing the order information.
> >
> > > > > > +
> > > > > > + return pfn_to_page(pfn);
> > > > > > +}
> > > > > > +
> > > > > > static void __init deserialize_bitmap(unsigned int order,
> > > > > > struct khoser_mem_bitmap_ptr *elm)
> > > > > > {
> > > > > > @@ -449,7 +466,7 @@ static void __init deserialize_bitmap(unsigned int order,
> > > > > > int sz = 1 << (order + PAGE_SHIFT);
> > > > > > phys_addr_t phys =
> > > > > > elm->phys_start + (bit << (order + PAGE_SHIFT));
> > > > > > - struct page *page = phys_to_page(phys);
> > > > > > + struct page *page = kho_get_preserved_page(phys, order);
> > > > >
> > > > > I think it's better to initialize deferred struct pages later in
> > > > > kho_restore_page. deserialize_bitmap() runs before SMP and it already does
> > > >
> > > > The KHO memory should still be accessible early in boot, right?
> > >
> > > The memory is accessible. And we anyway should not use struct page for
> > > preserved memory before kho_restore_{folio,pages}.
> >
> > This makes sense, what happens if someone calls kho_restore_folio()
> > before deferred pages are initialized?
>
> That's fine, because this memory is still memblock_reserve()ed and deferred
> init skips reserved ranges.
> There is a problem however with the calls to kho_restore_{pages,folio}()
> after memblock is gone because we can't use early_pfn_to_nid() then.

I agree with the regarding memblock and early_pfn_to_nid(), but I
don't think we need to rely on early_pfn_to_nid() during the restore
phase.

> I think we can start with Evangelos' approach that initializes struct pages
> at deserialize time and then we'll see how to optimize it.

Let's do the lazy tail initialization that I proposed to you in a
chat: we initialize only the head struct page during
deserialize_bitmap(). Since this happens while memblock is still
active, we can safely use early_pfn_to_nid() to set the nid in the
head page's flags, and also preserve order as we do today.

Then, we can defer the initialization of all tail pages to
kho_restore_folio(). At that stage, we no longer need memblock or
early_pfn_to_nid(); we can simply inherit the nid from the head page
using page_to_nid(head).

This approach seems to give us the best of both worlds: It avoids the
memblock dependency during restoration. It keeps the serial work in
deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
heavy lifting of tail page initialization to be done later in the boot
process, potentially in parallel, as you suggested.

Thanks,
Pasha