Re: [PATCH v2 4/8] x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax

From: Dan Williams
Date: Fri Jun 07 2019 - 11:28:21 EST


On Fri, Jun 7, 2019 at 5:29 AM Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
>
> On Sat, 1 Jun 2019 at 06:26, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> >
> > On Fri, May 31, 2019 at 8:30 AM Ard Biesheuvel
> > <ard.biesheuvel@xxxxxxxxxx> wrote:
> > >
> > > On Fri, 31 May 2019 at 17:28, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, May 31, 2019 at 1:30 AM Ard Biesheuvel
> > > > <ard.biesheuvel@xxxxxxxxxx> wrote:
> > > > >
> > > > > (cc Mike for memblock)
> > > > >
> > > > > On Fri, 31 May 2019 at 01:13, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > > > > >
> > > > > > UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the
> > > > > > interpretation of the EFI Memory Types as "reserved for a special
> > > > > > purpose".
> > > > > >
> > > > > > The proposed Linux behavior for specific purpose memory is that it is
> > > > > > reserved for direct-access (device-dax) by default and not available for
> > > > > > any kernel usage, not even as an OOM fallback. Later, through udev
> > > > > > scripts or another init mechanism, these device-dax claimed ranges can
> > > > > > be reconfigured and hot-added to the available System-RAM with a unique
> > > > > > node identifier.
> > > > > >
> > > > > > This patch introduces 3 new concepts at once given the entanglement
> > > > > > between early boot enumeration relative to memory that can optionally be
> > > > > > reserved from the kernel page allocator by default. The new concepts
> > > > > > are:
> > > > > >
> > > > > > - E820_TYPE_SPECIFIC: Upon detecting the EFI_MEMORY_SP attribute on
> > > > > > EFI_CONVENTIONAL memory, update the E820 map with this new type. Only
> > > > > > perform this classification if the CONFIG_EFI_SPECIFIC_DAX=y policy is
> > > > > > enabled, otherwise treat it as typical ram.
> > > > > >
> > > > >
> > > > > OK, so now we have 'special purpose', 'specific' and 'app specific'
> > > > > [below]. Do they all mean the same thing?
> > > >
> > > > I struggled with separating the raw-EFI-type name from the name of the
> > > > Linux specific policy. Since the reservation behavior is optional I
> > > > was thinking there should be a distinct Linux kernel name for that
> > > > policy. I did try to go back and change all occurrences of "special"
> > > > to "specific" from the RFC to this v2, but seems I missed one.
> > > >
> > >
> > > OK
> >
> > I'll go ahead and use "application reserved" terminology consistently
> > throughout the code to distinguish that Linux translation from the raw
> > "EFI specific purpose" attribute.
> >
>
> OK
>
> > >
> > > > >
> > > > > > - IORES_DESC_APPLICATION_RESERVED: Add a new I/O resource descriptor for
> > > > > > a device driver to search iomem resources for application specific
> > > > > > memory. Teach the iomem code to identify such ranges as "Application
> > > > > > Reserved".
> > > > > >
> > > > > > - MEMBLOCK_APP_SPECIFIC: Given the memory ranges can fallback to the
> > > > > > traditional System RAM pool the expectation is that they will have
> > > > > > typical SRAT entries. In order to support a policy of device-dax by
> > > > > > default with the option to hotplug later, the numa initialization code
> > > > > > is taught to avoid marking online MEMBLOCK_APP_SPECIFIC regions.
> > > > > >
> > > > >
> > > > > Can we move the generic memblock changes into a separate patch please?
> > > >
> > > > Yeah, that can move to a lead-in patch.
> > > >
> > > > [..]
> > > > > > diff --git a/include/linux/efi.h b/include/linux/efi.h
> > > > > > index 91368f5ce114..b57b123cbdf9 100644
> > > > > > --- a/include/linux/efi.h
> > > > > > +++ b/include/linux/efi.h
> > > > > > @@ -129,6 +129,19 @@ typedef struct {
> > > > > > u64 attribute;
> > > > > > } efi_memory_desc_t;
> > > > > >
> > > > > > +#ifdef CONFIG_EFI_SPECIFIC_DAX
> > > > > > +static inline bool is_efi_dax(efi_memory_desc_t *md)
> > > > > > +{
> > > > > > + return md->type == EFI_CONVENTIONAL_MEMORY
> > > > > > + && (md->attribute & EFI_MEMORY_SP);
> > > > > > +}
> > > > > > +#else
> > > > > > +static inline bool is_efi_dax(efi_memory_desc_t *md)
> > > > > > +{
> > > > > > + return false;
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > > typedef struct {
> > > > > > efi_guid_t guid;
> > > > > > u32 headersize;
> > > > >
> > > > > I'd prefer it if we could avoid this DAX policy distinction leaking
> > > > > into the EFI layer.
> > > > >
> > > > > IOW, I am fine with having a 'is_efi_sp_memory()' helper here, but
> > > > > whether that is DAX memory or not should be decided in the DAX layer.
> > > >
> > > > Ok, how about is_efi_sp_ram()? Since EFI_MEMORY_SP might be applied to
> > > > things that aren't EFI_CONVENTIONAL_MEMORY.
> > >
> > > Yes, that is fine. As long as the #ifdef lives in the DAX code and not here.
> >
> > We still need some ifdef in the efi core because that is the central
> > location to make the policy distinction to identify identify
> > EFI_CONVENTIONAL_MEMORY differently depending on whether EFI_MEMORY_SP
> > is present. I agree with you that "dax" should be dropped from the
> > naming. So how about:
> >
> > #ifdef CONFIG_EFI_APPLICATION_RESERVED
> > static inline bool is_efi_application_reserved(efi_memory_desc_t *md)
> > {
> > return md->type == EFI_CONVENTIONAL_MEMORY
> > && (md->attribute & EFI_MEMORY_SP);
> > }
> > #else
> > static inline bool is_efi_application_reserved(efi_memory_desc_t *md)
> > {
> > return false;
> > }
> > #endif
>
> I think this policy decision should not live inside the EFI subsystem.
> EFI just gives you the memory map, and mangling that information
> depending on whether you think a certain memory attribute should be
> ignored is the job of the MM subsystem.

The problem is that we don't have an mm subsystem at the time a
decision needs to be made. The reservation policy needs to be deployed
before even memblock has been initialized in order to keep kernel
allocations out of the reservation. I agree with the sentiment I just
don't see how to practically achieve an optional "System RAM" vs
"Application Reserved" routing decision without an early (before
e820__memblock_setup()) conditional branch.