Re: [PATCH] mm: hugetlb: Only prep and add allocated folios for non-gigantic pages

From: Mike Kravetz
Date: Mon Oct 09 2023 - 21:24:53 EST


On 10/09/23 15:56, Usama Arif wrote:
> Calling prep_and_add_allocated_folios when allocating gigantic pages
> at boot time causes the kernel to crash as folio_list is empty
> and iterating it causes a NULL pointer dereference. Call this only
> for non-gigantic pages when folio_list has entires.

Thanks!

However, are you sure the issue is the result of iterating through a
NULL list? For reference, the routine prep_and_add_allocated_folios is:

static void prep_and_add_allocated_folios(struct hstate *h,
struct list_head *folio_list)
{
struct folio *folio, *tmp_f;

/* Add all new pool pages to free lists in one lock cycle */
spin_lock_irq(&hugetlb_lock);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
spin_unlock_irq(&hugetlb_lock);
}

If folio_list is empty, then the only code that should be executed is
acquiring the lock, notice the list is empty, release the lock.

In the case of gigantic pages addressed below, I do see the warning:

[ 0.055140] DEBUG_LOCKS_WARN_ON(early_boot_irqs_disabled)
[ 0.055149] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4345 lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055153] Modules linked in:
[ 0.055155] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.0-rc4+ #40
[ 0.055157] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[ 0.055158] RIP: 0010:lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055160] Code: 00 85 c0 0f 84 5e ff ff ff 8b 0d a7 20 74 01 85 c9 0f 85 50 ff ff ff 48 c7 c6 48 25 42 82 48 c7 c7 70 7f 40 82 e8 18 10 f7 ff <0f> 0b 5b e9 e0 d8 af 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 0.055162] RSP: 0000:ffffffff82603d40 EFLAGS: 00010086 ORIG_RAX: 0000000000000000
[ 0.055164] RAX: 0000000000000000 RBX: ffffffff827911e0 RCX: 0000000000000000
[ 0.055165] RDX: 0000000000000004 RSI: ffffffff8246b3e1 RDI: 00000000ffffffff
[ 0.055166] RBP: 0000000000000002 R08: 0000000000000001 R09: 0000000000000000
[ 0.055166] R10: ffffffffffffffff R11: 284e4f5f4e524157 R12: 0000000000000001
[ 0.055167] R13: ffffffff82eb6316 R14: ffffffff82603d70 R15: ffffffff82ee5f70
[ 0.055169] FS: 0000000000000000(0000) GS:ffff888277c00000(0000) knlGS:0000000000000000
[ 0.055170] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.055171] CR2: ffff88847ffff000 CR3: 000000000263a000 CR4: 00000000000200b0
[ 0.055174] Call Trace:
[ 0.055174] <TASK>
[ 0.055175] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055177] ? __warn+0x81/0x170
[ 0.055181] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055182] ? report_bug+0x18d/0x1c0
[ 0.055186] ? early_fixup_exception+0x92/0xb0
[ 0.055189] ? early_idt_handler_common+0x2f/0x40
[ 0.055194] ? lockdep_hardirqs_on_prepare+0x1a8/0x1b0
[ 0.055196] trace_hardirqs_on+0x10/0xa0
[ 0.055198] _raw_spin_unlock_irq+0x24/0x50
[ 0.055201] hugetlb_hstate_alloc_pages+0x311/0x3e0
[ 0.055206] hugepages_setup+0x220/0x2c0
[ 0.055210] unknown_bootoption+0x98/0x1d0
[ 0.055213] parse_args+0x152/0x440
[ 0.055216] ? __pfx_unknown_bootoption+0x10/0x10
[ 0.055220] start_kernel+0x1af/0x6c0
[ 0.055222] ? __pfx_unknown_bootoption+0x10/0x10
[ 0.055225] x86_64_start_reservations+0x14/0x30
[ 0.055227] x86_64_start_kernel+0x74/0x80
[ 0.055229] secondary_startup_64_no_verify+0x166/0x16b
[ 0.055234] </TASK>
[ 0.055235] irq event stamp: 0
[ 0.055236] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[ 0.055238] hardirqs last disabled at (0): [<0000000000000000>] 0x0
[ 0.055239] softirqs last enabled at (0): [<0000000000000000>] 0x0
[ 0.055240] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 0.055240] ---[ end trace 0000000000000000 ]---

This is because interrupts are not enabled this early in boot, and the
spin_unlock_irq() would incorrectly enable interrupts too early. I wonder
if this 'warning' could translate to a panic or NULL deref under certain
configurations?

Konrad, I am interested to see if this addresses your booting problem. But,
your stack trace is a bit different. My 'guess' is that this will not address
your issue. If it does not, can you try the following patch? This
applies to next-20231009.
--
Mike Kravetz

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f3749fc125d4..8346c98e5616 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2178,18 +2178,19 @@ static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
static void prep_and_add_allocated_folios(struct hstate *h,
struct list_head *folio_list)
{
+ unsigned long flags;
struct folio *folio, *tmp_f;

/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);

/* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irq(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
- spin_unlock_irq(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*
@@ -3224,13 +3225,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
static void __init prep_and_add_bootmem_folios(struct hstate *h,
struct list_head *folio_list)
{
+ unsigned long flags;
struct folio *folio, *tmp_f;

/* Send list for bulk vmemmap optimization processing */
hugetlb_vmemmap_optimize_folios(h, folio_list);

/* Add all new pool pages to free lists in one lock cycle */
- spin_lock_irq(&hugetlb_lock);
+ spin_lock_irqsave(&hugetlb_lock, flags);
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
/*
@@ -3246,7 +3248,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
__prep_account_new_huge_page(h, folio_nid(folio));
enqueue_hugetlb_folio(h, folio);
}
- spin_unlock_irq(&hugetlb_lock);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
}

/*