Re: [PATCH v3 5/5] x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()

From: Jarkko Sakkinen
Date: Wed Mar 10 2021 - 06:31:39 EST


Weird. I did check my kernel org last time on Thrusday night but did not
get this. I was actually wondering the lack of feedback.

Then I had suddenly huge pile of email waiting for me on Monday with
bunch emails from around the time you sent this one.

On Wed, Mar 03, 2021 at 04:20:03PM -0800, Dave Hansen wrote:
> What changed from the last patch?
>
> On 3/3/21 7:03 AM, Jarkko Sakkinen wrote:
> > Background
> > ==========
> >
> > EPC section is covered by one or more SRAT entries that are associated with
> > one and only one PXM (NUMA node). The motivation behind this patch is to
> > provide basic elements of building allocation scheme based on this premise.
>
> Just like normal RAM, enclave memory (EPC) should be covered by entries
> in the ACPI SRAT table. These entries allow each EPC section to be
> associated with a NUMA node.
>
> Use this information to implement a simple NUMA-aware allocator for
> enclave memory.
>
> > Use phys_to_target_node() to associate each NUMA node with the EPC
> > sections contained within its range. In sgx_alloc_epc_page(), first try
> > to allocate from the NUMA node, where the CPU is executing. If that
> > fails, fallback to the legacy allocation.
>
> By "legacy", you mean the one from the last patch? :)
>
> > Link: https://lore.kernel.org/lkml/158188326978.894464.217282995221175417.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> > Signed-off-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx>
> > ---
> > arch/x86/Kconfig | 1 +
> > arch/x86/kernel/cpu/sgx/main.c | 84 ++++++++++++++++++++++++++++++++++
> > arch/x86/kernel/cpu/sgx/sgx.h | 9 ++++
> > 3 files changed, 94 insertions(+)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index a5f6a3013138..7eb1e96cfe8a 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -1940,6 +1940,7 @@ config X86_SGX
> > depends on CRYPTO_SHA256=y
> > select SRCU
> > select MMU_NOTIFIER
> > + select NUMA_KEEP_MEMINFO if NUMA
>
> This dependency is worth mentioning somewhere. Why do we suddenly need
> NUMA_KEEP_MEMINFO?
>
> > +/* Nodes with one or more EPC sections. */
> > +static nodemask_t sgx_numa_mask;
> > +
> > +/*
> > + * Array with one list_head for each possible NUMA node. Each
> > + * list contains all the sgx_epc_section's which are on that
>
> ^ no "'", please
>
> > + * node.
> > + */
> > +static struct sgx_numa_node *sgx_numa_nodes;
> > +
> > +/*
> > + * sgx_free_epc_page() uses this to find out the correct struct sgx_numa_node,
> > + * to put the page in.
> > + */
> > +static int sgx_section_to_numa_node_id[SGX_MAX_EPC_SECTIONS];
>
> If this is per-section, why not put it in struct sgx_epc_section?

Because struct sgx_epc_page does not contain a pointer to
struct sgx_epc_section.

>
> > /*
> > @@ -434,6 +451,36 @@ static bool __init sgx_page_reclaimer_init(struct list_head *laundry)
> > return true;
> > }
> >
> > +static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
> > +{
> > + struct sgx_epc_page *page = NULL;
> > + struct sgx_numa_node *sgx_node;
> > +
> > + if (WARN_ON_ONCE(nid < 0 || nid >= num_possible_nodes()))
> > + return NULL;
>
> This has exactly one call-site which plumbs numa_node_id() in here
> pretty directly. Is this check worthwhile?

Probably not.


> > + if (!node_isset(nid, sgx_numa_mask))
> > + return NULL;
> > +
> > + sgx_node = &sgx_numa_nodes[nid];
> > +
> > + spin_lock(&sgx_free_page_list_lock);
>
> The glocal lock protecting a per-node structure is a bit unsightly.

The patch set could introduce additional patch for changing the
locking scheme. It's logically a separate change.

> > + if (list_empty(&sgx_node->free_page_list)) {
> > + spin_unlock(&sgx_free_page_list_lock);
> > + return NULL;
> > + }
> > +
> > + page = list_first_entry(&sgx_node->free_page_list, struct sgx_epc_page, numa_list);
> > + list_del_init(&page->numa_list);
> > + list_del_init(&page->list);
> > + sgx_nr_free_pages--;
> > +
> > + spin_unlock(&sgx_free_page_list_lock);
> > +
> > + return page;
> > +}
> > +
> > /**
> > * __sgx_alloc_epc_page() - Allocate an EPC page
> > *
> > @@ -446,8 +493,14 @@ static bool __init sgx_page_reclaimer_init(struct list_head *laundry)
> > */
> > struct sgx_epc_page *__sgx_alloc_epc_page(void)
> > {
> > + int current_nid = numa_node_id();
> > struct sgx_epc_page *page;
> >
> > + /* Try to allocate EPC from the current node, first: */
> > + page = __sgx_alloc_epc_page_from_node(current_nid);
> > + if (page)
> > + return page;
> > +
> > spin_lock(&sgx_free_page_list_lock);
> >
> > if (list_empty(&sgx_free_page_list)) {
> > @@ -456,6 +509,7 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
> > }
> >
> > page = list_first_entry(&sgx_free_page_list, struct sgx_epc_page, list);
> > + list_del_init(&page->numa_list);
> > list_del_init(&page->list);
> > sgx_nr_free_pages--;
>
> I would much rather prefer that this does what the real page allocator
> does: kep the page on a single list. That list is maintained
> per-NUMA-node. Allocations try local NUMA node structures, then fall
> back to other structures (hopefully in a locality-aware fashion).
>
> I wrote you the loop that I want to see this implement in an earlier
> review. This, basically:
>
> page = NULL;
> nid = numa_node_id();
> while (true) {
> page = __sgx_alloc_epc_page_from_node(nid);
> if (page)
> break;
>
> nid = // ... some search here, next_node_in()...
> // check if we wrapped around:
> if (nid == numa_node_id())
> break;
> }
>
> There's no global list. You just walk around nodes trying to find one
> with space. If you wrap around, you stop.
>
> Please implement this. If you think it's a bad idea, or can't, let's
> talk about it in advance. Right now, it appears that my review comments
> aren't being incorporated into newer versions.

How I interpreted your earlier comments is that the fallback is unfair and
this patch set version does fix that.

I can buy the above allocation scheme, but I don't think this patch set
version is a step backwards. The things done to struct sgx_epc_section
are exactly what should be done to it.

Implementation-wise you are asking me to squash 4/5 and 5/5 into a single
patch, and remove global list. It's a tiny iteration from this patch
version and I can do it.

> > void sgx_free_epc_page(struct sgx_epc_page *page)
> > {
> > + int nid = sgx_section_to_numa_node_id[page->section];
> > + struct sgx_numa_node *sgx_node = &sgx_numa_nodes[nid];
> > int ret;
> >
> > WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> > @@ -575,7 +631,15 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> > return;
> >
> > spin_lock(&sgx_free_page_list_lock);
> > +
> > + /* Enable NUMA local allocation in sgx_alloc_epc_page(). */
> > + if (!node_isset(nid, sgx_numa_mask)) {
> > + INIT_LIST_HEAD(&sgx_node->free_page_list);
> > + node_set(nid, sgx_numa_mask);
> > + }
> > +
> > list_add_tail(&page->list, &sgx_free_page_list);
> > + list_add_tail(&page->numa_list, &sgx_node->free_page_list);
> > sgx_nr_free_pages++;
> > spin_unlock(&sgx_free_page_list_lock);
> > }
> > @@ -626,8 +690,28 @@ static bool __init sgx_page_cache_init(struct list_head *laundry)
> > {
> > u32 eax, ebx, ecx, edx, type;
> > u64 pa, size;
> > + int nid;
> > int i;
> >
> > + nodes_clear(sgx_numa_mask);
>
> Is this really required for a variable allocated in .bss?

Probably not, I'll check what nodes_clear() does.

> > + sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL);
>
> This is what I was looking for here, thanks!
>
> > + /*
> > + * Create NUMA node lookup table for sgx_free_epc_page() as the very
> > + * first step, as it is used to populate the free list's during the
> > + * initialization.
> > + */
> > + for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
> > + nid = numa_map_to_online_node(phys_to_target_node(pa));
> > + if (nid == NUMA_NO_NODE) {
> > + /* The physical address is already printed above. */
> > + pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n");
> > + nid = 0;
> > + }
> > +
> > + sgx_section_to_numa_node_id[i] = nid;
> > + }
> > +
> > for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) {
> > cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx);
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> > index 41ca045a574a..3a3c07fc0c8e 100644
> > --- a/arch/x86/kernel/cpu/sgx/sgx.h
> > +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> > @@ -27,6 +27,7 @@ struct sgx_epc_page {
> > unsigned int flags;
> > struct sgx_encl_page *owner;
> > struct list_head list;
> > + struct list_head numa_list;
> > };
>
> I'll say it again, explicitly: Each sgx_epc_page should be on one and
> only one free list: a per-NUMA-node list.
>
> > /*
> > @@ -43,6 +44,14 @@ struct sgx_epc_section {
> >
> > extern struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> >
> > +/*
> > + * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
> > + * the free page list local to the node is stored here.
> > + */
> > +struct sgx_numa_node {
> > + struct list_head free_page_list;
> > +};
>
> I think it's unconscionable to leave this protected by a global lock.
> Please at least give us a per-node spinlock proteting this list.

I can do it but I'll add a separate commit for it. It's better to make
locking scheme changes that way (IMHO). Helps with bisection later on...

/Jarkko