Re: [PATCHv14 5/9] efi: Add unaccepted memory support

From: Michael Roth
Date: Tue Oct 10 2023 - 17:05:52 EST


On Tue, Jun 06, 2023 at 05:26:33PM +0300, Kirill A. Shutemov wrote:
> efi_config_parse_tables() reserves memory that holds unaccepted memory
> configuration table so it won't be reused by page allocator.
>
> Core-mm requires few helpers to support unaccepted memory:
>
> - accept_memory() checks the range of addresses against the bitmap and
> accept memory if needed.
>
> - range_contains_unaccepted_memory() checks if anything within the
> range requires acceptance.
>
> Architectural code has to provide efi_get_unaccepted_table() that
> returns pointer to the unaccepted memory configuration table.
>
> arch_accept_memory() handles arch-specific part of memory acceptance.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
> Reviewed-by: Ard Biesheuvel <ardb@xxxxxxxxxx>
> Reviewed-by: Tom Lendacky <thomas.lendacky@xxxxxxx>
> ---
> arch/x86/platform/efi/efi.c | 3 +
> drivers/firmware/efi/Makefile | 1 +
> drivers/firmware/efi/efi.c | 25 +++++
> drivers/firmware/efi/unaccepted_memory.c | 112 +++++++++++++++++++++++
> include/linux/efi.h | 1 +
> 5 files changed, 142 insertions(+)
> create mode 100644 drivers/firmware/efi/unaccepted_memory.c
>
> diff --git a/drivers/firmware/efi/unaccepted_memory.c b/drivers/firmware/efi/unaccepted_memory.c
> new file mode 100644
> index 000000000000..08a9a843550a
> --- /dev/null
> +++ b/drivers/firmware/efi/unaccepted_memory.c
> @@ -0,0 +1,112 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/efi.h>
> +#include <linux/memblock.h>
> +#include <linux/spinlock.h>
> +#include <asm/unaccepted_memory.h>
> +
> +/* Protects unaccepted memory bitmap */
> +static DEFINE_SPINLOCK(unaccepted_memory_lock);
> +
> +/*
> + * accept_memory() -- Consult bitmap and accept the memory if needed.
> + *
> + * Only memory that is explicitly marked as unaccepted in the bitmap requires
> + * an action. All the remaining memory is implicitly accepted and doesn't need
> + * acceptance.
> + *
> + * No need to accept:
> + * - anything if the system has no unaccepted table;
> + * - memory that is below phys_base;
> + * - memory that is above the memory that addressable by the bitmap;
> + */
> +void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> + struct efi_unaccepted_memory *unaccepted;
> + unsigned long range_start, range_end;
> + unsigned long flags;
> + u64 unit_size;
> +
> + unaccepted = efi_get_unaccepted_table();
> + if (!unaccepted)
> + return;
> +
> + unit_size = unaccepted->unit_size;
> +
> + /*
> + * Only care for the part of the range that is represented
> + * in the bitmap.
> + */
> + if (start < unaccepted->phys_base)
> + start = unaccepted->phys_base;
> + if (end < unaccepted->phys_base)
> + return;
> +
> + /* Translate to offsets from the beginning of the bitmap */
> + start -= unaccepted->phys_base;
> + end -= unaccepted->phys_base;
> +
> + /* Make sure not to overrun the bitmap */
> + if (end > unaccepted->size * unit_size * BITS_PER_BYTE)
> + end = unaccepted->size * unit_size * BITS_PER_BYTE;
> +
> + range_start = start / unit_size;
> +
> + spin_lock_irqsave(&unaccepted_memory_lock, flags);
> + for_each_set_bitrange_from(range_start, range_end, unaccepted->bitmap,
> + DIV_ROUND_UP(end, unit_size)) {
> + unsigned long phys_start, phys_end;
> + unsigned long len = range_end - range_start;
> +
> + phys_start = range_start * unit_size + unaccepted->phys_base;
> + phys_end = range_end * unit_size + unaccepted->phys_base;
> +
> + arch_accept_memory(phys_start, phys_end);
> + bitmap_clear(unaccepted->bitmap, range_start, len);
> + }
> + spin_unlock_irqrestore(&unaccepted_memory_lock, flags);
> +}

While testing SNP guests running today's tip/master (ef19bc9dddc3) I ran
into what seems to be fairly significant lock contention due to the
unaccepted_memory_lock spinlock above, which results in a constant stream
of soft-lockups until the workload gets all its memory accepted/faulted
in if the guest has around 16+ vCPUs.

I've included the guest dmesg traces I was seeing below.

In this case I was running a 32 vCPU guest with 200GB of memory running on
a 256 thread EPYC (Milan) system, and can trigger the above situation fairly
reliably by running the following workload in a freshly-booted guests:

stress --vm 32 --vm-bytes 5G --vm-keep

Scaling up the number of stress threads and vCPUs should make it easier
to reproduce.

Other than unresponsiveness/lockup messages until the memory is accepted,
the guest seems to continue running fine, but for large guests where
unaccepted memory is more likely to be useful, it seems like it could be
an issue, especially when consider 100+ vCPU guests.

-Mike

[ 105.753073] watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [stress:1590]
[ 105.756109] Modules linked in: btrfs(E) blake2b_generic(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) multipath(E) linear(E) crc32_pclmul(E) virtio_net(E) i2c_i801(E) net_failover(E) i2c_smbus(E) psmouse(E) failover(E) virtio_scsi(E) lpc_ich(E)
[ 105.771644] irq event stamp: 392274
[ 105.773852] hardirqs last enabled at (392273): [<ffffffff9fbddf9a>] _raw_spin_unlock_irqrestore+0x5a/0x70
[ 105.780058] hardirqs last disabled at (392274): [<ffffffff9fbddca9>] _raw_spin_lock_irqsave+0x69/0x70
[ 105.785486] softirqs last enabled at (392158): [<ffffffff9fbdf663>] __do_softirq+0x2a3/0x357
[ 105.790627] softirqs last disabled at (392149): [<ffffffff9ecd0f5c>] irq_exit_rcu+0x8c/0xb0
[ 105.795972] CPU: 3 PID: 1590 Comm: stress Tainted: G EL 6.6.0-rc5-snp-guest-v10n+ #1
[ 105.802416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 105.807040] RIP: 0010:_raw_spin_unlock_irqrestore+0x45/0x70
[ 105.810327] Code: b1 b1 18 ff 4c 89 e7 e8 79 e7 18 ff 81 e3 00 02 00 00 75 26 9c 58 0f 1f 40 00 f6 c4 02 75 22 48 85 db 74 06 fb 0f 1f 44 00 00 <65> ff 0d fc 5d 46 60 5b 41 5c 5d c3 cc cc cc cc e8 c6 9f 28 ff eb
[ 105.818749] RSP: 0000:ffffc9000585faa8 EFLAGS: 00000206
[ 105.821155] RAX: 0000000000000046 RBX: 0000000000000200 RCX: 00000000000028ce
[ 105.825258] RDX: ffffffff9f8c7830 RSI: ffffffff9fbddf9a RDI: ffffffff9fbddf9a
[ 105.828287] RBP: ffffc9000585fab8 R08: 0000000000000000 R09: 0000000000000000
[ 105.831322] R10: 000000ffffffffff R11: 0000000000000000 R12: ffffffffa091a160
[ 105.834713] R13: 00000000000028cd R14: ffff88807c459030 R15: ffff88807c459018
[ 105.838230] FS: 00007fa4a8ee8740(0000) GS:ffff88b1aff80000(0000) knlGS:0000000000000000
[ 105.841677] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 105.844115] CR2: 00007fa3719ec010 CR3: 000800010f7c2003 CR4: 0000000000770ef0
[ 105.847160] PKRU: 55555554
[ 105.848490] Call Trace:
[ 105.849634] <IRQ>
[ 105.850517] ? show_regs+0x68/0x70
[ 105.851986] ? watchdog_timer_fn+0x1dc/0x240
[ 105.853868] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 105.855827] ? __hrtimer_run_queues+0x1c3/0x340
[ 105.857821] ? hrtimer_interrupt+0x109/0x240
[ 105.859673] ? __sysvec_apic_timer_interrupt+0x67/0x170
[ 105.861941] ? sysvec_apic_timer_interrupt+0x7b/0x90
[ 105.864048] </IRQ>
[ 105.864981] <TASK>
[ 105.865901] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 105.868110] ? accept_memory+0x150/0x170
[ 105.869795] ? _raw_spin_unlock_irqrestore+0x5a/0x70
[ 105.871885] ? _raw_spin_unlock_irqrestore+0x5a/0x70
[ 105.873966] ? _raw_spin_unlock_irqrestore+0x45/0x70
[ 105.876047] accept_memory+0x150/0x170
[ 105.877665] try_to_accept_memory+0x134/0x1b0
[ 105.879532] get_page_from_freelist+0xa3e/0x1370
[ 105.881479] ? lock_acquire+0xd8/0x2b0
[ 105.883045] __alloc_pages+0x1b7/0x390
[ 105.884619] __folio_alloc+0x1b/0x50
[ 105.886151] ? policy_node+0x57/0x70
[ 105.887673] vma_alloc_folio+0xa6/0x360
[ 105.889325] do_pte_missing+0x1a5/0x8b0
[ 105.890948] __handle_mm_fault+0x75e/0xda0
[ 105.892693] handle_mm_fault+0xe1/0x2d0
[ 105.894304] do_user_addr_fault+0x1ce/0xa40
[ 105.896050] ? exit_to_user_mode_prepare+0xa4/0x230
[ 105.898113] exc_page_fault+0x84/0x200
[ 105.899693] asm_exc_page_fault+0x27/0x30
[ 105.901405] RIP: 0033:0x55ebd4602cf0
[ 105.902926] Code: 8b 54 24 0c 31 c0 85 d2 0f 94 c0 89 04 24 41 83 fd 02 0f 8f 3e 02 00 00 48 85 ed 48 89 d8 7e 1b 66 2e 0f 1f 84 00 00 00 00 00 <c6> 00 5a 4c 01 f8 48 89 c2 48 29 da 48 39 d5 7f ef 49 83 fc 00 0f
[ 105.910671] RSP: 002b:00007ffebfede470 EFLAGS: 00010206
[ 105.912897] RAX: 00007fa3719ec010 RBX: 00007fa3683ff010 RCX: 00007fa3683ff010
[ 105.915859] RDX: 00000000095ed000 RSI: 0000000140001000 RDI: 0000000000000000
[ 105.918853] RBP: 0000000140000000 R08: 00000000ffffffff R09: 0000000000000000
[ 105.921836] R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
[ 105.924810] R13: 0000000000000002 R14: fffffffffffff000 R15: 0000000000001000
[ 105.927771] </TASK>
[ 110.367774] watchdog: BUG: soft lockup - CPU#15 stuck for 32s! [stress:1594]
[ 110.369415] watchdog: BUG: soft lockup - CPU#13 stuck for 36s! [stress:1602]