Re: [PATCH v5 04/10] x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax

From: Ard Biesheuvel
Date: Fri Sep 13 2019 - 09:00:09 EST


On Fri, 30 Aug 2019 at 03:06, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
> interpretation of the EFI Memory Types as "reserved for a specific
> purpose".
>
> The proposed Linux behavior for specific purpose memory is that it is
> reserved for direct-access (device-dax) by default and not available for
> any kernel usage, not even as an OOM fallback. Later, through udev
> scripts or another init mechanism, these device-dax claimed ranges can
> be reconfigured and hot-added to the available System-RAM with a unique
> node identifier. This device-dax management scheme implements "soft" in
> the "soft reserved" designation by allowing some or all of the
> reservation to be recovered as typical memory. This policy can be
> disabled at compile-time with CONFIG_EFI_SOFT_RESERVE=n, or runtime with
> efi=nosoftreserve.
>
> This patch introduces 2 new concepts at once given the entanglement
> between early boot enumeration relative to memory that can optionally be
> reserved from the kernel page allocator by default. The new concepts
> are:
>
> - E820_TYPE_SOFT_RESERVED: Upon detecting the EFI_MEMORY_SP
> attribute on EFI_CONVENTIONAL memory, update the E820 map with this
> new type. Only perform this classification if the
> CONFIG_EFI_SOFT_RESERVE=y policy is enabled, otherwise treat it as
> typical ram.
>
> - IORES_DESC_SOFT_RESERVED: Add a new I/O resource descriptor for
> a device driver to search iomem resources for application specific
> memory. Teach the iomem code to identify such ranges as "Soft Reserved".
>
> A follow-on change integrates parsing of the ACPI HMAT to identify the
> node and sub-range boundaries of EFI_MEMORY_SP designated memory. For
> now, just identify and reserve memory of this type.
>
> The translation of EFI_CONVENTIONAL_MEMORY + EFI_MEMORY_SP to "soft
> reserved" is x86/E820-only, but other archs could choose to publish
> IORES_DESC_SOFT_RESERVED resources from their platform-firmware memory
> map handlers. Other EFI-capable platforms would need to go audit their
> local usages of EFI_CONVENTIONAL_MEMORY to consider the soft reserved
> case.
>
> Cc: <x86@xxxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
> Cc: Darren Hart <dvhart@xxxxxxxxxxxxx>
> Cc: Andy Shevchenko <andy@xxxxxxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
> Reported-by: kbuild test robot <lkp@xxxxxxxxx>
> Reviewed-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>

Hi Dan,

I understand that non-x86 may be out of scope for you, but this patch
makes changes to x86 and generic code at the same time without regard
for other architectures.
I'd prefer it if we could cover ARM cleanly as well right at the start.

The first step would be to split out the EFI stub changes (i.e., to
avoid allocating memory from EFI_MEMORY_SP regions) and the EFI core
changes from the other changes. Then, I would like to ask for your
help to get the arm64 part implemented where EFI_MEMORY_SP memory gets
registered/reserved in a way that allows the HMAT code (which should
be arch agnostic) to operate in the same way as it does on x86. Would
it be enough to simply memblock_reserve() it and insert the iomem
resource with the soft_reserved attribute?

Some more comments below.

> ---
> Documentation/admin-guide/kernel-parameters.txt | 19 +++++++--
> arch/x86/Kconfig | 21 +++++++++
> arch/x86/boot/compressed/eboot.c | 7 +++
> arch/x86/boot/compressed/kaslr.c | 4 ++
> arch/x86/include/asm/e820/types.h | 8 ++++
> arch/x86/include/asm/efi-stub.h | 11 +++++
> arch/x86/kernel/e820.c | 12 +++++
> arch/x86/platform/efi/efi.c | 51 +++++++++++++++++++++--
> drivers/firmware/efi/efi.c | 3 +
> drivers/firmware/efi/libstub/efi-stub-helper.c | 12 +++++
> include/linux/efi.h | 1
> include/linux/ioport.h | 1
> 12 files changed, 139 insertions(+), 11 deletions(-)
> create mode 100644 arch/x86/include/asm/efi-stub.h
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 1c67acd1df65..dd28f0726309 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1152,7 +1152,8 @@
> Format: {"off" | "on" | "skip[mbr]"}
>
> efi= [EFI]
> - Format: { "old_map", "nochunk", "noruntime", "debug" }
> + Format: { "old_map", "nochunk", "noruntime", "debug",
> + "nosoftreserve" }
> old_map [X86-64]: switch to the old ioremap-based EFI
> runtime services mapping. 32-bit still uses this one by
> default.
> @@ -1161,6 +1162,12 @@
> firmware implementations.
> noruntime : disable EFI runtime services support
> debug: enable misc debug output
> + nosoftreserve: The EFI_MEMORY_SP (Specific Purpose)
> + attribute may cause the kernel to reserve the
> + memory range for a memory mapping driver to
> + claim. Specify efi=nosoftreserve to disable this
> + reservation and treat the memory by its base type
> + (i.e. EFI_CONVENTIONAL_MEMORY / "System RAM").
>
> efi_no_storage_paranoia [EFI; X86]
> Using this parameter you can use more than 50% of
> @@ -1173,15 +1180,21 @@
> updating original EFI memory map.
> Region of memory which aa attribute is added to is
> from ss to ss+nn.
> +
> If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000
> is specified, EFI_MEMORY_MORE_RELIABLE(0x10000)
> attribute is added to range 0x100000000-0x180000000 and
> 0x10a0000000-0x1120000000.
>
> + If efi_fake_mem=8G@9G:0x40000 is specified, the
> + EFI_MEMORY_SP(0x40000) attribute is added to
> + range 0x240000000-0x43fffffff.
> +
> Using this parameter you can do debugging of EFI memmap
> - related feature. For example, you can do debugging of
> + related features. For example, you can do debugging of
> Address Range Mirroring feature even if your box
> - doesn't support it.
> + doesn't support it, or mark specific memory as
> + "soft reserved".
>
> efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT
> that is to be dynamically loaded by Linux. If there are
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 4195f44c6a09..bced13503bb1 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1981,6 +1981,27 @@ config EFI_MIXED
>
> If unsure, say N.
>
> +config EFI_SOFT_RESERVE
> + bool "Reserve EFI Specific Purpose Memory"
> + depends on EFI && ACPI_HMAT
> + default ACPI_HMAT
> + ---help---
> + On systems that have mixed performance classes of memory EFI
> + may indicate specific purpose memory with an attribute (See
> + EFI_MEMORY_SP in UEFI 2.8). A memory range tagged with this
> + attribute may have unique performance characteristics compared
> + to the system's general purpose "System RAM" pool. On the
> + expectation that such memory has application specific usage,
> + and its base EFI memory type is "conventional" answer Y to
> + arrange for the kernel to reserve it as a "Soft Reserved"
> + resource, and set aside for direct-access (device-dax) by
> + default. The memory range can later be optionally assigned to
> + the page allocator by system administrator policy via the
> + device-dax kmem facility. Say N to have the kernel treat this
> + memory as "System RAM" by default.
> +
> + If unsure, say Y.
> +

This should be in generic code.

> config SECCOMP
> def_bool y
> prompt "Enable seccomp to safely compute untrusted bytecode"
> diff --git a/arch/x86/boot/compressed/eboot.c b/arch/x86/boot/compressed/eboot.c
> index d6662fdef300..f2dc5896d770 100644
> --- a/arch/x86/boot/compressed/eboot.c
> +++ b/arch/x86/boot/compressed/eboot.c
> @@ -10,6 +10,7 @@
> #include <linux/pci.h>
>
> #include <asm/efi.h>
> +#include <asm/efi-stub.h>
> #include <asm/e820/types.h>
> #include <asm/setup.h>
> #include <asm/desc.h>
> @@ -553,7 +554,11 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s
> case EFI_BOOT_SERVICES_CODE:
> case EFI_BOOT_SERVICES_DATA:
> case EFI_CONVENTIONAL_MEMORY:
> - e820_type = E820_TYPE_RAM;
> + if (!efi_nosoftreserve
> + && (d->attribute & EFI_MEMORY_SP))
> + e820_type = E820_TYPE_SOFT_RESERVED;
> + else
> + e820_type = E820_TYPE_RAM;
> break;
>
> case EFI_ACPI_MEMORY_NVS:
> diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
> index 2e53c056ba20..093e84e28b7a 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -38,6 +38,7 @@
> #include <linux/efi.h>
> #include <generated/utsrelease.h>
> #include <asm/efi.h>
> +#include <asm/efi-stub.h>
>
> /* Macros used by the included decompressor code below. */
> #define STATIC
> @@ -760,6 +761,9 @@ process_efi_entries(unsigned long minimum, unsigned long image_size)
> if (md->type != EFI_CONVENTIONAL_MEMORY)
> continue;
>
> + if (!efi_nosoftreserve && (md->attribute & EFI_MEMORY_SP))
> + continue;
> +
> if (efi_mirror_found &&
> !(md->attribute & EFI_MEMORY_MORE_RELIABLE))
> continue;
> diff --git a/arch/x86/include/asm/e820/types.h b/arch/x86/include/asm/e820/types.h
> index c3aa4b5e49e2..314f75d886d0 100644
> --- a/arch/x86/include/asm/e820/types.h
> +++ b/arch/x86/include/asm/e820/types.h
> @@ -28,6 +28,14 @@ enum e820_type {
> */
> E820_TYPE_PRAM = 12,
>
> + /*
> + * Special-purpose memory is indicated to the system via the
> + * EFI_MEMORY_SP attribute. Define an e820 translation of this
> + * memory type for the purpose of reserving this range and
> + * marking it with the IORES_DESC_SOFT_RESERVED designation.
> + */
> + E820_TYPE_SOFT_RESERVED = 0xefffffff,
> +
> /*
> * Reserved RAM used by the kernel itself if
> * CONFIG_INTEL_TXT=y is enabled, memory of this type
> diff --git a/arch/x86/include/asm/efi-stub.h b/arch/x86/include/asm/efi-stub.h
> new file mode 100644
> index 000000000000..16ebd036387b
> --- /dev/null
> +++ b/arch/x86/include/asm/efi-stub.h
> @@ -0,0 +1,11 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#ifndef _X86_EFI_STUB_H_
> +#define _X86_EFI_STUB_H_
> +
> +#ifdef CONFIG_EFI_STUB
> +extern bool efi_nosoftreserve;
> +#else
> +#define efi_nosoftreserve (1)
> +#endif
> +
> +#endif /* _X86_EFI_STUB_H_ */

Please put this in generic code as well (but you need a function not a
variable - see below)

> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index 7da2bcd2b8eb..9976106b57ec 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -190,6 +190,7 @@ static void __init e820_print_type(enum e820_type type)
> case E820_TYPE_RAM: /* Fall through: */
> case E820_TYPE_RESERVED_KERN: pr_cont("usable"); break;
> case E820_TYPE_RESERVED: pr_cont("reserved"); break;
> + case E820_TYPE_SOFT_RESERVED: pr_cont("soft reserved"); break;
> case E820_TYPE_ACPI: pr_cont("ACPI data"); break;
> case E820_TYPE_NVS: pr_cont("ACPI NVS"); break;
> case E820_TYPE_UNUSABLE: pr_cont("unusable"); break;
> @@ -1037,6 +1038,7 @@ static const char *__init e820_type_to_string(struct e820_entry *entry)
> case E820_TYPE_PRAM: return "Persistent Memory (legacy)";
> case E820_TYPE_PMEM: return "Persistent Memory";
> case E820_TYPE_RESERVED: return "Reserved";
> + case E820_TYPE_SOFT_RESERVED: return "Soft Reserved";
> default: return "Unknown E820 type";
> }
> }
> @@ -1052,6 +1054,7 @@ static unsigned long __init e820_type_to_iomem_type(struct e820_entry *entry)
> case E820_TYPE_PRAM: /* Fall-through: */
> case E820_TYPE_PMEM: /* Fall-through: */
> case E820_TYPE_RESERVED: /* Fall-through: */
> + case E820_TYPE_SOFT_RESERVED: /* Fall-through: */
> default: return IORESOURCE_MEM;
> }
> }
> @@ -1064,6 +1067,7 @@ static unsigned long __init e820_type_to_iores_desc(struct e820_entry *entry)
> case E820_TYPE_PMEM: return IORES_DESC_PERSISTENT_MEMORY;
> case E820_TYPE_PRAM: return IORES_DESC_PERSISTENT_MEMORY_LEGACY;
> case E820_TYPE_RESERVED: return IORES_DESC_RESERVED;
> + case E820_TYPE_SOFT_RESERVED: return IORES_DESC_SOFT_RESERVED;
> case E820_TYPE_RESERVED_KERN: /* Fall-through: */
> case E820_TYPE_RAM: /* Fall-through: */
> case E820_TYPE_UNUSABLE: /* Fall-through: */
> @@ -1078,11 +1082,12 @@ static bool __init do_mark_busy(enum e820_type type, struct resource *res)
> return true;
>
> /*
> - * Treat persistent memory like device memory, i.e. reserve it
> - * for exclusive use of a driver
> + * Treat persistent memory and other special memory ranges like
> + * device memory, i.e. reserve it for exclusive use of a driver
> */
> switch (type) {
> case E820_TYPE_RESERVED:
> + case E820_TYPE_SOFT_RESERVED:
> case E820_TYPE_PRAM:
> case E820_TYPE_PMEM:
> return false;
> @@ -1285,6 +1290,9 @@ void __init e820__memblock_setup(void)
> if (end != (resource_size_t)end)
> continue;
>
> + if (entry->type == E820_TYPE_SOFT_RESERVED)
> + memblock_reserve(entry->addr, entry->size);
> +
> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> continue;
>
> diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
> index 0bb58eb33ca0..9cfb7f1cf25d 100644
> --- a/arch/x86/platform/efi/efi.c
> +++ b/arch/x86/platform/efi/efi.c
> @@ -151,10 +151,18 @@ void __init efi_find_mirror(void)
> * more than the max 128 entries that can fit in the e820 legacy
> * (zeropage) memory map.
> */
> +enum add_efi_mode {
> + ADD_EFI_ALL,
> + ADD_EFI_SOFT_RESERVED,
> +};
>
> -static void __init do_add_efi_memmap(void)
> +static void __init do_add_efi_memmap(enum add_efi_mode mode)
> {
> efi_memory_desc_t *md;
> + int add = 0;
> +
> + if (!efi_enabled(EFI_MEMMAP))
> + return;
>
> for_each_efi_memory_desc(md) {
> unsigned long long start = md->phys_addr;
> @@ -167,7 +175,10 @@ static void __init do_add_efi_memmap(void)
> case EFI_BOOT_SERVICES_CODE:
> case EFI_BOOT_SERVICES_DATA:
> case EFI_CONVENTIONAL_MEMORY:
> - if (md->attribute & EFI_MEMORY_WB)
> + if (efi_enabled(EFI_MEM_SOFT_RESERVE)
> + && (md->attribute & EFI_MEMORY_SP))
> + e820_type = E820_TYPE_SOFT_RESERVED;
> + else if (md->attribute & EFI_MEMORY_WB)
> e820_type = E820_TYPE_RAM;
> else
> e820_type = E820_TYPE_RESERVED;
> @@ -193,9 +204,17 @@ static void __init do_add_efi_memmap(void)
> e820_type = E820_TYPE_RESERVED;
> break;
> }
> +
> + if (e820_type == E820_TYPE_SOFT_RESERVED)
> + /* always add E820_TYPE_SOFT_RESERVED */;
> + else if (mode == ADD_EFI_SOFT_RESERVED)
> + continue;
> +
> + add++;
> e820__range_add(start, size, e820_type);
> }
> - e820__update_table(e820_table);
> + if (add)
> + e820__update_table(e820_table);
> }
>
> int __init efi_memblock_x86_reserve_range(void)
> @@ -227,8 +246,18 @@ int __init efi_memblock_x86_reserve_range(void)
> if (rv)
> return rv;
>
> - if (add_efi_memmap)
> - do_add_efi_memmap();
> + if (add_efi_memmap) {
> + do_add_efi_memmap(ADD_EFI_ALL);
> + } else {
> + /*
> + * Given add_efi_memmap defaults to 0 and there there is no e820
> + * mechanism for soft-reserved memory. Explicitly scan for
> + * soft-reserved memory. Otherwise, the mechanism to disable the
> + * kernel's consideration of EFI_MEMORY_SP is the
> + * efi=nosoftreserve option.
> + */
> + do_add_efi_memmap(ADD_EFI_SOFT_RESERVED);
> + }
>
> WARN(efi.memmap.desc_version != 1,
> "Unexpected EFI_MEMORY_DESCRIPTOR version %ld",
> @@ -781,6 +810,15 @@ static bool should_map_region(efi_memory_desc_t *md)
> if (IS_ENABLED(CONFIG_X86_32))
> return false;
>
> + /*
> + * EFI specific purpose memory may be reserved by default
> + * depending on kernel config and boot options.
> + */
> + if (md->type == EFI_CONVENTIONAL_MEMORY
> + && efi_enabled(EFI_MEM_SOFT_RESERVE)
> + && (md->attribute & EFI_MEMORY_SP))
> + return false;
> +
> /*
> * Map all of RAM so that we can access arguments in the 1:1
> * mapping when making EFI runtime calls.
> @@ -1072,6 +1110,9 @@ static int __init arch_parse_efi_cmdline(char *str)
> if (parse_option_str(str, "old_map"))
> set_bit(EFI_OLD_MEMMAP, &efi.flags);
>
> + if (parse_option_str(str, "nosoftreserve"))
> + clear_bit(EFI_MEM_SOFT_RESERVE, &efi.flags);
> +

Can we move this to the generic efi= handling code?

> return 0;
> }
> early_param("efi", arch_parse_efi_cmdline);
> diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
> index 363bb9d00fa5..6d54d5c74347 100644
> --- a/drivers/firmware/efi/efi.c
> +++ b/drivers/firmware/efi/efi.c
> @@ -52,6 +52,9 @@ struct efi __read_mostly efi = {
> .tpm_log = EFI_INVALID_TABLE_ADDR,
> .tpm_final_log = EFI_INVALID_TABLE_ADDR,
> .mem_reserve = EFI_INVALID_TABLE_ADDR,
> +#ifdef CONFIG_EFI_SOFT_RESERVE
> + .flags = 1UL << EFI_MEM_SOFT_RESERVE,
> +#endif
> };
> EXPORT_SYMBOL(efi);
>

I'd prefer it if we could call this EFI_MEM_NO_SOFT_RESERVE instead,
and invert the meaning of the bit.

> diff --git a/drivers/firmware/efi/libstub/efi-stub-helper.c b/drivers/firmware/efi/libstub/efi-stub-helper.c
> index 3caae7f2cf56..35ee98a2c00c 100644
> --- a/drivers/firmware/efi/libstub/efi-stub-helper.c
> +++ b/drivers/firmware/efi/libstub/efi-stub-helper.c
> @@ -28,6 +28,7 @@
> #define EFI_READ_CHUNK_SIZE (1024 * 1024)
>
> static unsigned long __chunk_size = EFI_READ_CHUNK_SIZE;
> +bool efi_nosoftreserve;
>

This needs a getter function if you want to access it from other
compilation units. This has to do with how the early relocation code
handles data symbol references. Please refer to nokaslr() for an
example.

> static int __section(.data) __nokaslr;
> static int __section(.data) __quiet;
> @@ -211,6 +212,9 @@ efi_status_t efi_high_alloc(efi_system_table_t *sys_table_arg,
> if (desc->type != EFI_CONVENTIONAL_MEMORY)
> continue;
>
> + if (!efi_nosoftreserve && (desc->attribute & EFI_MEMORY_SP))
> + continue;
> +
> if (desc->num_pages < nr_pages)
> continue;
>
> @@ -305,6 +309,9 @@ efi_status_t efi_low_alloc(efi_system_table_t *sys_table_arg,
> if (desc->type != EFI_CONVENTIONAL_MEMORY)
> continue;
>
> + if (!efi_nosoftreserve && (desc->attribute & EFI_MEMORY_SP))
> + continue;
> +
> if (desc->num_pages < nr_pages)
> continue;
>
> @@ -489,6 +496,11 @@ efi_status_t efi_parse_options(char const *cmdline)
> __novamap = 1;
> }
>
> + if (!strncmp(str, "nosoftreserve", 7)) {
> + str += strlen("nosoftreserve");
> + efi_nosoftreserve = 1;
> + }
> +
> /* Group words together, delimited by "," */
> while (*str && *str != ' ' && *str != ',')
> str++;
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index acc2b8982ed2..f50e0f01a5ed 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -1201,6 +1201,7 @@ extern int __init efi_setup_pcdp_console(char *);
> #define EFI_DBG 8 /* Print additional debug info at runtime */
> #define EFI_NX_PE_DATA 9 /* Can runtime data regions be mapped non-executable? */
> #define EFI_MEM_ATTR 10 /* Did firmware publish an EFI_MEMORY_ATTRIBUTES table? */
> +#define EFI_MEM_SOFT_RESERVE 11 /* Is the kernel configured to honor soft reservations? */
>
> #ifdef CONFIG_EFI
> /*
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 5b6a7121c9f0..17d9b1abc2f0 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -134,6 +134,7 @@ enum {
> IORES_DESC_PERSISTENT_MEMORY_LEGACY = 5,
> IORES_DESC_DEVICE_PRIVATE_MEMORY = 6,
> IORES_DESC_RESERVED = 7,
> + IORES_DESC_SOFT_RESERVED = 8,
> };
>
> /*
>