Re: [PATCH V3] x86/sgx: Fix free page accounting

From: Jarkko Sakkinen
Date: Tue Nov 16 2021 - 11:01:35 EST


On Mon, 2021-11-15 at 11:29 -0800, Reinette Chatre wrote:
> The SGX driver maintains a single global free page counter,
> sgx_nr_free_pages, that reflects the number of free pages available
> across all NUMA nodes. Correspondingly, a list of free pages is
> associated with each NUMA node and sgx_nr_free_pages is updated
> every time a page is added or removed from any of the free page
> lists. The main usage of sgx_nr_free_pages is by the reclaimer
> that runs when it (sgx_nr_free_pages) goes below a watermark
> to ensure that there are always some free pages available to, for
> example, support efficient page faults.
>
> With sgx_nr_free_pages accessed and modified from a few places
> it is essential to ensure that these accesses are done safely but
> this is not the case. sgx_nr_free_pages is read without any
> protection and updated with inconsistent protection by any one
> of the spin locks associated with the individual NUMA nodes.
> For example:
>
>       CPU_A                                 CPU_B
>       -----                                 -----
>  spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
>  ...                                   ...
>  sgx_nr_free_pages--;  /* NOT SAFE */  sgx_nr_free_pages--;
>
>  spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
>
> Since sgx_nr_free_pages may be protected by different spin locks
> while being modified from different CPUs, the following scenario
> is possible:
>
>       CPU_A                                CPU_B
>       -----                                -----
> {sgx_nr_free_pages = 100}
>  spin_lock(&nodeA->lock);              spin_lock(&nodeB->lock);
>  sgx_nr_free_pages--;                  sgx_nr_free_pages--;
>  /* LOAD sgx_nr_free_pages = 100 */    /* LOAD sgx_nr_free_pages = 100 */
>  /* sgx_nr_free_pages--          */    /* sgx_nr_free_pages--          */
>  /* STORE sgx_nr_free_pages = 99 */    /* STORE sgx_nr_free_pages = 99 */
>  spin_unlock(&nodeA->lock);            spin_unlock(&nodeB->lock);
>
> In the above scenario, sgx_nr_free_pages is decremented from two CPUs
> but instead of sgx_nr_free_pages ending with a value that is two less
> than it started with, it was only decremented by one while the number
> of free pages were actually reduced by two. The consequence of
> sgx_nr_free_pages not being protected is that its value may not
> accurately reflect the actual number of free pages on the system,
> impacting the availability of free pages in support of many flows.
>
> The problematic scenario is when the reclaimer does not run because it
> believes there to be sufficient free pages while any attempt to allocate
> a page fails because there are no free pages available. In the SGX driver
> the reclaimer's watermark is only 32 pages so after encountering the
> above example scenario 32 times a user space hang is possible when there
> are no more free pages because of repeated page faults caused by no
> free pages made available.
>
> The following flow was encountered:
> asm_exc_page_fault
>  ...
>    sgx_vma_fault()
>      sgx_encl_load_page()
>        sgx_encl_eldu() // Encrypted page needs to be loaded from backing
>                        // storage into newly allocated SGX memory page
>          sgx_alloc_epc_page() // Allocate a page of SGX memory
>            __sgx_alloc_epc_page() // Fails, no free SGX memory
>            ...
>            if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
>              wake_up(&ksgxd_waitq);
>            return -EBUSY; // Return -EBUSY giving reclaimer time to run
>        return -EBUSY;
>      return -EBUSY;
>    return VM_FAULT_NOPAGE;
>
> The reclaimer is triggered in above flow with the following code:
>
> static bool sgx_should_reclaim(unsigned long watermark)
> {
>         return sgx_nr_free_pages < watermark &&
>                !list_empty(&sgx_active_page_list);
> }
>
> In the problematic scenario there were no free pages available yet the
> value of sgx_nr_free_pages was above the watermark. The allocation of
> SGX memory thus always failed because of a lack of free pages while no
> free pages were made available because the reclaimer is never started
> because of sgx_nr_free_pages' incorrect value. The consequence was that
> user space kept encountering VM_FAULT_NOPAGE that caused the same
> address to be accessed repeatedly with the same result.
>
> Change the global free page counter to an atomic type that
> ensures simultaneous updates are done safely. While doing so, move
> the updating of the variable outside of the spin lock critical
> section to which it does not belong.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Suggested-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Reviewed-by: Tony Luck <tony.luck@xxxxxxxxx>
> Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
> ---
> Changes since V2:
> - V2:
> https://lore.kernel.org/lkml/b2e69e9febcae5d98d331de094d9cc7ce3217e66.1636487172.git.reinette.chatre@xxxxxxxxx/
> - Update changelog to provide example of unsafe variable modification (Jarkko).
>
> Changes since V1:
> - V1:
>   https://lore.kernel.org/lkml/373992d869cd356ce9e9afe43ef4934b70d604fd.1636049678.git.reinette.chatre@xxxxxxxxx/
> - Add static to definition of sgx_nr_free_pages (Tony).
> - Add Tony's signature.
> - Provide detail about error scenario in changelog (Jarkko).
>
>  arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 63d3de02bbcc..8471a8b9b48e 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
>  static LIST_HEAD(sgx_active_page_list);
>  static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>  
> -/* The free page list lock protected variables prepend the lock. */
> -static unsigned long sgx_nr_free_pages;
> +static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
>  /* Nodes with one or more EPC sections. */
>  static nodemask_t sgx_numa_mask;
> @@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
>  
>                 spin_lock(&node->lock);
>                 list_add_tail(&epc_page->list, &node->free_page_list);
> -               sgx_nr_free_pages++;
>                 spin_unlock(&node->lock);
> +               atomic_long_inc(&sgx_nr_free_pages);
>         }
>  }
>  
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
> -       return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
> +       return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> +              !list_empty(&sgx_active_page_list);
>  }
>  
>  static int ksgxd(void *p)
> @@ -471,9 +471,9 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
>  
>         page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
>         list_del_init(&page->list);
> -       sgx_nr_free_pages--;
>  
>         spin_unlock(&node->lock);
> +       atomic_long_dec(&sgx_nr_free_pages);
>  
>         return page;
>  }
> @@ -625,9 +625,9 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>         spin_lock(&node->lock);
>  
>         list_add_tail(&page->list, &node->free_page_list);
> -       sgx_nr_free_pages++;
>  
>         spin_unlock(&node->lock);
> +       atomic_long_inc(&sgx_nr_free_pages);
>  }
>  
>  static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,

Acked-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx>

/Jarkko