Re: [PATCH V3] x86/sgx: Fix free page accounting
From: Jarkko Sakkinen
Date: Tue Nov 16 2021 - 11:01:35 EST
On Mon, 2021-11-15 at 11:29 -0800, Reinette Chatre wrote:
> The SGX driver maintains a single global free page counter,
> sgx_nr_free_pages, that reflects the number of free pages available
> across all NUMA nodes. Correspondingly, a list of free pages is
> associated with each NUMA node and sgx_nr_free_pages is updated
> every time a page is added or removed from any of the free page
> lists. The main usage of sgx_nr_free_pages is by the reclaimer
> that runs when it (sgx_nr_free_pages) goes below a watermark
> to ensure that there are always some free pages available to, for
> example, support efficient page faults.
>
> With sgx_nr_free_pages accessed and modified from a few places
> it is essential to ensure that these accesses are done safely but
> this is not the case. sgx_nr_free_pages is read without any
> protection and updated with inconsistent protection by any one
> of the spin locks associated with the individual NUMA nodes.
> For example:
>
> CPU_A CPU_B
> ----- -----
> spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
> ... ...
> sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--;
>
> spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);
>
> Since sgx_nr_free_pages may be protected by different spin locks
> while being modified from different CPUs, the following scenario
> is possible:
>
> CPU_A CPU_B
> ----- -----
> {sgx_nr_free_pages = 100}
> spin_lock(&nodeA->lock); spin_lock(&nodeB->lock);
> sgx_nr_free_pages--; sgx_nr_free_pages--;
> /* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */
> /* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */
> /* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */
> spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock);
>
> In the above scenario, sgx_nr_free_pages is decremented from two CPUs
> but instead of sgx_nr_free_pages ending with a value that is two less
> than it started with, it was only decremented by one while the number
> of free pages were actually reduced by two. The consequence of
> sgx_nr_free_pages not being protected is that its value may not
> accurately reflect the actual number of free pages on the system,
> impacting the availability of free pages in support of many flows.
>
> The problematic scenario is when the reclaimer does not run because it
> believes there to be sufficient free pages while any attempt to allocate
> a page fails because there are no free pages available. In the SGX driver
> the reclaimer's watermark is only 32 pages so after encountering the
> above example scenario 32 times a user space hang is possible when there
> are no more free pages because of repeated page faults caused by no
> free pages made available.
>
> The following flow was encountered:
> asm_exc_page_fault
> ...
> sgx_vma_fault()
> sgx_encl_load_page()
> sgx_encl_eldu() // Encrypted page needs to be loaded from backing
> // storage into newly allocated SGX memory page
> sgx_alloc_epc_page() // Allocate a page of SGX memory
> __sgx_alloc_epc_page() // Fails, no free SGX memory
> ...
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer
> wake_up(&ksgxd_waitq);
> return -EBUSY; // Return -EBUSY giving reclaimer time to run
> return -EBUSY;
> return -EBUSY;
> return VM_FAULT_NOPAGE;
>
> The reclaimer is triggered in above flow with the following code:
>
> static bool sgx_should_reclaim(unsigned long watermark)
> {
> return sgx_nr_free_pages < watermark &&
> !list_empty(&sgx_active_page_list);
> }
>
> In the problematic scenario there were no free pages available yet the
> value of sgx_nr_free_pages was above the watermark. The allocation of
> SGX memory thus always failed because of a lack of free pages while no
> free pages were made available because the reclaimer is never started
> because of sgx_nr_free_pages' incorrect value. The consequence was that
> user space kept encountering VM_FAULT_NOPAGE that caused the same
> address to be accessed repeatedly with the same result.
>
> Change the global free page counter to an atomic type that
> ensures simultaneous updates are done safely. While doing so, move
> the updating of the variable outside of the spin lock critical
> section to which it does not belong.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
> Suggested-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Reviewed-by: Tony Luck <tony.luck@xxxxxxxxx>
> Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
> ---
> Changes since V2:
> - V2:
> https://lore.kernel.org/lkml/b2e69e9febcae5d98d331de094d9cc7ce3217e66.1636487172.git.reinette.chatre@xxxxxxxxx/
> - Update changelog to provide example of unsafe variable modification (Jarkko).
>
> Changes since V1:
> - V1:
> https://lore.kernel.org/lkml/373992d869cd356ce9e9afe43ef4934b70d604fd.1636049678.git.reinette.chatre@xxxxxxxxx/
> - Add static to definition of sgx_nr_free_pages (Tony).
> - Add Tony's signature.
> - Provide detail about error scenario in changelog (Jarkko).
>
> arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 63d3de02bbcc..8471a8b9b48e 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
> static LIST_HEAD(sgx_active_page_list);
> static DEFINE_SPINLOCK(sgx_reclaimer_lock);
>
> -/* The free page list lock protected variables prepend the lock. */
> -static unsigned long sgx_nr_free_pages;
> +static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>
> /* Nodes with one or more EPC sections. */
> static nodemask_t sgx_numa_mask;
> @@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
>
> spin_lock(&node->lock);
> list_add_tail(&epc_page->list, &node->free_page_list);
> - sgx_nr_free_pages++;
> spin_unlock(&node->lock);
> + atomic_long_inc(&sgx_nr_free_pages);
> }
> }
>
> static bool sgx_should_reclaim(unsigned long watermark)
> {
> - return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
> + return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> + !list_empty(&sgx_active_page_list);
> }
>
> static int ksgxd(void *p)
> @@ -471,9 +471,9 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
>
> page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list);
> list_del_init(&page->list);
> - sgx_nr_free_pages--;
>
> spin_unlock(&node->lock);
> + atomic_long_dec(&sgx_nr_free_pages);
>
> return page;
> }
> @@ -625,9 +625,9 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> spin_lock(&node->lock);
>
> list_add_tail(&page->list, &node->free_page_list);
> - sgx_nr_free_pages++;
>
> spin_unlock(&node->lock);
> + atomic_long_inc(&sgx_nr_free_pages);
> }
>
> static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
Acked-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx>
/Jarkko