Re: [PATCH v2 2/2] mm: fix initialization of struct page for holes in memory layout

From: Qian Cai
Date: Mon Jan 04 2021 - 14:04:40 EST


On Wed, 2020-12-09 at 23:43 +0200, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@xxxxxxxxxxxxx>
>
> There could be struct pages that are not backed by actual physical memory.
> This can happen when the actual memory bank is not a multiple of
> SECTION_SIZE or when an architecture does not register memory holes
> reserved by the firmware as memblock.memory.
>
> Such pages are currently initialized using init_unavailable_mem() function
> that iterated through PFNs in holes in memblock.memory and if there is a
> struct page corresponding to a PFN, the fields if this page are set to
> default values and it is marked as Reserved.
>
> init_unavailable_mem() does not take into account zone and node the page
> belongs to and sets both zone and node links in struct page to zero.
>
> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> instance in a configuration below:
>
> # grep -A1 E820 /proc/iomem
> 7a17b000-7a216fff : Unknown E820 type
> 7a217000-7bffffff : System RAM
>
> unset zone link in struct page will trigger
>
> VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
>
> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in
> struct page) in the same pageblock.
>
> Interleave initialization of pages that correspond to holes with the
> initialization of memory map, so that zone and node information will be
> properly set on such pages.
>
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> that check each PFN")
> Reported-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxx>

Reverting this commit on the top of today's linux-next fixed a crash while
reading /proc/kpagecount on a NUMA server.

[ 8858.006726][T99897] BUG: unable to handle page fault for address: fffffffffffffffe
[ 8858.014814][T99897] #PF: supervisor read access in kernel mode
[ 8858.020686][T99897] #PF: error_code(0x0000) - not-present page
[ 8858.026557][T99897] PGD 1371417067 P4D 1371417067 PUD 1371419067 PMD 0
[ 8858.033224][T99897] Oops: 0000 [#1] SMP KASAN NOPTI
[ 8858.038710][T99897] CPU: 28 PID: 99897 Comm: proc01 Tainted: G O 5.11.0-rc1-next-20210104 #1
[ 8858.048515][T99897] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 03/09/2018
[ 8858.057794][T99897] RIP: 0010:kpagecount_read+0x1be/0x5e0
PageSlab at include/linux/page-flags.h:342
(inlined by) kpagecount_read at fs/proc/page.c:69
[ 8858.063717][T99897] Code: 3c 30 00 0f 85 29 03 00 00 48 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 89 c2 48 c1 ea 03 42 80 3c 32 00 0f 85 e7 02 00 00 <48> 83 38 ff 0f 84 f3 01 00 00 48 89 c8 48 c1 e8 03 42 80 3c 30 00
[ 8858.083303][T99897] RSP: 0018:ffffc9002159fdd0 EFLAGS: 00010246
[ 8858.089637][T99897] RAX: fffffffffffffffe RBX: ffffea0011fce000 RCX: ffffea0011fce008
[ 8858.097518][T99897] RDX: 1fffffffffffffff RSI: 000000000064d7c0 RDI: ffffffff951f91c8
[ 8858.105396][T99897] RBP: 000000000064d7c0 R08: ffffed129063f402 R09: ffffed129063f402
[ 8858.113760][T99897] R10: ffff8894831fa00b R11: ffffed129063f401 R12: 000000000047f380
[ 8858.121639][T99897] R13: 0000000000000400 R14: dffffc0000000000 R15: 000000000064d7c0
[ 8858.129517][T99897] FS: 00007fd18849d040(0000) GS:ffff88a02fc00000(0000) knlGS:0000000000000000
[ 8858.138886][T99897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8858.145369][T99897] CR2: fffffffffffffffe CR3: 0000001c8b5d0000 CR4: 00000000003506e0
[ 8858.153247][T99897] Call Trace:
[ 8858.156415][T99897] proc_reg_read+0x1a6/0x240
[ 8858.161345][T99897] vfs_read+0x175/0x440
[ 8858.165383][T99897] ksys_read+0xf1/0x1c0
[ 8858.169420][T99897] ? vfs_write+0x870/0x870
[ 8858.173719][T99897] ? task_work_run+0xeb/0x170
[ 8858.178284][T99897] ? syscall_enter_from_user_mode+0x1c/0x40
[ 8858.184073][T99897] do_syscall_64+0x33/0x40
[ 8858.188863][T99897] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 8858.194652][T99897] RIP: 0033:0x7fd187da1d5d
[ 8858.198952][T99897] Code: 31 11 2b 00 31 c9 64 83 3e 0b 75 ca eb b8 e8 ca fb ff ff 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 39 ca 77 2b 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 0b c3 66 2e 0f 1f 84 00 00 00 00 00 48 8b 15
[ 8858.218978][T99897] RSP: 002b:00007ffe733de1f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 8858.227297][T99897] RAX: ffffffffffffffda RBX: 00007ffe733df370 RCX: 00007fd187da1d5d
[ 8858.235824][T99897] RDX: 0000000000000400 RSI: 000000000064d7c0 RDI: 0000000000000004
[ 8858.243739][T99897] RBP: 0000000000000400 R08: 00000000018fbe73 R09: 00007fd187e13d40
[ 8858.251617][T99897] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000023f9c00
[ 8858.259496][T99897] R13: 0000000000000004 R14: 000000000044663c R15: 0000000000000000
[ 8858.267856][T99897] Modules linked in: vfat fat fuse vfio_pci vfio_virqfd vfio_iommu_type1 vfio loop iavf kvm_amd ses kvm enclosure irqbypass acpi_cpufreq ip_tables x_tables sd_mod smartpqi bnxt_en scsi_transport_sas tg3 i40e firmware_class libphy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: init_module]
[ 8858.296328][T99897] CR2: fffffffffffffffe
[ 8858.300365][T99897] ---[ end trace a307ff8b6e284ee0 ]---
[ 8858.305712][T99897] RIP: 0010:kpagecount_read+0x1be/0x5e0
[ 8858.311613][T99897] Code: 3c 30 00 0f 85 29 03 00 00 48 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 89 c2 48 c1 ea 03 42 80 3c 32 00 0f 85 e7 02 00 00 <48> 83 38 ff 0f 84 f3 01 00 00 48 89 c8 48 c1 e8 03 42 80 3c 30 00
[ 8858.331200][T99897] RSP: 0018:ffffc9002159fdd0 EFLAGS: 00010246
[ 8858.337573][T99897] RAX: fffffffffffffffe RBX: ffffea0011fce000 RCX: ffffea0011fce008
[ 8858.345454][T99897] RDX: 1fffffffffffffff RSI: 000000000064d7c0 RDI: ffffffff951f91c8
[ 8858.353333][T99897] RBP: 000000000064d7c0 R08: ffffed129063f402 R09: ffffed129063f402
[ 8858.361618][T99897] R10: ffff8894831fa00b R11: ffffed129063f401 R12: 000000000047f380
[ 8858.369497][T99897] R13: 0000000000000400 R14: dffffc0000000000 R15: 000000000064d7c0
[ 8858.377377][T99897] FS: 00007fd18849d040(0000) GS:ffff88a02fc00000(0000) knlGS:0000000000000000
[ 8858.386696][T99897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8858.393177][T99897] CR2: fffffffffffffffe CR3: 0000001c8b5d0000 CR4: 00000000003506e0
[ 8858.401056][T99897] Kernel panic - not syncing: Fatal exception
[ 8858.407348][T99897] Kernel Offset: 0x12600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 8858.419260][T99897] ---[ end Kernel panic - not syncing: Fatal exception ]---

> ---
> mm/page_alloc.c | 152 +++++++++++++++++++++---------------------------
> 1 file changed, 65 insertions(+), 87 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dbc57dbbacd8..ea5aefef0004 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6185,24 +6185,85 @@ static void __meminit zone_init_free_lists(struct zone
> *zone)
> }
> }
>
> -void __meminit __weak memmap_init(unsigned long size, int nid,
> - unsigned long zone,
> - unsigned long range_start_pfn)
> +#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> +/*
> + * Only struct pages that are backed by physical memory available to the
> + * kernel are zeroed and initialized by memmap_init_zone().
> + * But, there are some struct pages that are either reserved by firmware or
> + * do not correspond to physical page frames becuase the actual memory bank
> + * is not a multiple of SECTION_SIZE.
> + * Fields of those struct pages may be accessed (for example page_to_pfn()
> + * on some configuration accesses page flags) so we must explicitly
> + * initialize those struct pages.
> + */
> +static u64 __init init_unavailable_range(unsigned long spfn, unsigned long
> epfn,
> + int zone, int node)
> {
> - unsigned long start_pfn, end_pfn;
> + unsigned long pfn;
> + u64 pgcnt = 0;
> +
> + for (pfn = spfn; pfn < epfn; pfn++) {
> + if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> + pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> + + pageblock_nr_pages - 1;
> + continue;
> + }
> + __init_single_page(pfn_to_page(pfn), pfn, zone, node);
> + __SetPageReserved(pfn_to_page(pfn));
> + pgcnt++;
> + }
> +
> + return pgcnt;
> +}
> +#else
> +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long
> epfn,
> + int zone, int node)
> +{
> + return 0;
> +}
> +#endif
> +
> +void __init __weak memmap_init(unsigned long size, int nid,
> + unsigned long zone,
> + unsigned long range_start_pfn)
> +{
> + unsigned long start_pfn, end_pfn, hole_start_pfn = 0;
> unsigned long range_end_pfn = range_start_pfn + size;
> + u64 pgcnt = 0;
> int i;
>
> for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> + hole_start_pfn = clamp(hole_start_pfn, range_start_pfn,
> + range_end_pfn);
>
> if (end_pfn > start_pfn) {
> size = end_pfn - start_pfn;
> memmap_init_zone(size, nid, zone, start_pfn,
> MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
> }
> +
> + if (hole_start_pfn < start_pfn)
> + pgcnt += init_unavailable_range(hole_start_pfn,
> + start_pfn, zone, nid);
> + hole_start_pfn = end_pfn;
> }
> +
> + /*
> + * Early sections always have a fully populated memmap for the whole
> + * section - see pfn_valid(). If the last section has holes at the
> + * end and that section is marked "online", the memmap will be
> + * considered initialized. Make sure that memmap has a well defined
> + * state.
> + */
> + if (hole_start_pfn < range_end_pfn)
> + pgcnt += init_unavailable_range(hole_start_pfn, range_end_pfn,
> + zone, nid);
> +
> + if (pgcnt)
> + pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
> + zone_names[zone], pgcnt);
> }
>
> static int zone_batchsize(struct zone *zone)
> @@ -6995,88 +7056,6 @@ void __init free_area_init_memoryless_node(int nid)
> free_area_init_node(nid);
> }
>
> -#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> -/*
> - * Initialize all valid struct pages in the range [spfn, epfn) and mark them
> - * PageReserved(). Return the number of struct pages that were initialized.
> - */
> -static u64 __init init_unavailable_range(unsigned long spfn, unsigned long
> epfn)
> -{
> - unsigned long pfn;
> - u64 pgcnt = 0;
> -
> - for (pfn = spfn; pfn < epfn; pfn++) {
> - if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> - pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> - + pageblock_nr_pages - 1;
> - continue;
> - }
> - /*
> - * Use a fake node/zone (0) for now. Some of these pages
> - * (in memblock.reserved but not in memblock.memory) will
> - * get re-initialized via reserve_bootmem_region() later.
> - */
> - __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> - __SetPageReserved(pfn_to_page(pfn));
> - pgcnt++;
> - }
> -
> - return pgcnt;
> -}
> -
> -/*
> - * Only struct pages that are backed by physical memory are zeroed and
> - * initialized by going through __init_single_page(). But, there are some
> - * struct pages which are reserved in memblock allocator and their fields
> - * may be accessed (for example page_to_pfn() on some configuration accesses
> - * flags). We must explicitly initialize those struct pages.
> - *
> - * This function also addresses a similar issue where struct pages are left
> - * uninitialized because the physical address range is not covered by
> - * memblock.memory or memblock.reserved. That could happen when memblock
> - * layout is manually configured via memmap=, or when the highest physical
> - * address (max_pfn) does not end on a section boundary.
> - */
> -static void __init init_unavailable_mem(void)
> -{
> - phys_addr_t start, end;
> - u64 i, pgcnt;
> - phys_addr_t next = 0;
> -
> - /*
> - * Loop through unavailable ranges not covered by memblock.memory.
> - */
> - pgcnt = 0;
> - for_each_mem_range(i, &start, &end) {
> - if (next < start)
> - pgcnt += init_unavailable_range(PFN_DOWN(next),
> - PFN_UP(start));
> - next = end;
> - }
> -
> - /*
> - * Early sections always have a fully populated memmap for the whole
> - * section - see pfn_valid(). If the last section has holes at the
> - * end and that section is marked "online", the memmap will be
> - * considered initialized. Make sure that memmap has a well defined
> - * state.
> - */
> - pgcnt += init_unavailable_range(PFN_DOWN(next),
> - round_up(max_pfn, PAGES_PER_SECTION));
> -
> - /*
> - * Struct pages that do not have backing memory. This could be because
> - * firmware is using some of this memory, or for some other reasons.
> - */
> - if (pgcnt)
> - pr_info("Zeroed struct page in unavailable ranges: %lld pages",
> pgcnt);
> -}
> -#else
> -static inline void __init init_unavailable_mem(void)
> -{
> -}
> -#endif /* !CONFIG_FLAT_NODE_MEM_MAP */
> -
> #if MAX_NUMNODES > 1
> /*
> * Figure out the number of possible node ids.
> @@ -7507,7 +7486,6 @@ void __init free_area_init(unsigned long *max_zone_pfn)
> /* Initialise every node */
> mminit_verify_pageflags_layout();
> setup_nr_node_ids();
> - init_unavailable_mem();
> for_each_online_node(nid) {
> pg_data_t *pgdat = NODE_DATA(nid);
> free_area_init_node(nid);
> --
> 2.28.0
>
>