Re: [PATCH v11 09/13] x86, sgx: basic routines for enclave page cache

From: Jarkko Sakkinen
Date: Tue Jun 19 2018 - 10:58:08 EST


On Fri, Jun 08, 2018 at 11:24:12AM -0700, Dave Hansen wrote:
> On 06/08/2018 10:09 AM, Jarkko Sakkinen wrote:
> > SGX has a set of data structures to maintain information about the enclaves
> > and their security properties. BIOS reserves a fixed size region of
> > physical memory for these structures by setting Processor Reserved Memory
> > Range Registers (PRMRR). This memory area is called Enclave Page Cache
> > (EPC).
> >
> > This commit implements the basic routines to allocate and free pages from
> > different EPC banks. There is also a swapper thread ksgxswapd for EPC pages
> > that gets woken up by sgx_alloc_page() when we run below the low watermark.
> > The swapper thread continues swapping pages up until it reaches the high
> > watermark.
>
> Yay! A new memory manager in arch-specific code.
>
> > Each subsystem that uses SGX must provide a set of callbacks for EPC
> > pages that are used to reclaim, block and write an EPC page. Kernel
> > takes the responsibility of maintaining LRU cache for them.
>
> What does a "subsystem that uses SGX" mean? Do we have one of those
> already?

Driver and KVM.

> > +struct sgx_secs {
> > + uint64_t size;
> > + uint64_t base;
> > + uint32_t ssaframesize;
> > + uint32_t miscselect;
> > + uint8_t reserved1[SGX_SECS_RESERVED1_SIZE];
> > + uint64_t attributes;
> > + uint64_t xfrm;
> > + uint32_t mrenclave[8];
> > + uint8_t reserved2[SGX_SECS_RESERVED2_SIZE];
> > + uint32_t mrsigner[8];
> > + uint8_t reserved3[SGX_SECS_RESERVED3_SIZE];
> > + uint16_t isvvprodid;
> > + uint16_t isvsvn;
> > + uint8_t reserved4[SGX_SECS_RESERVED4_SIZE];
> > +};
>
> This is a hardware structure, right? Doesn't it need to be packed?

Everything is aligned properly in this struct.

> > +enum sgx_tcs_flags {
> > + SGX_TCS_DBGOPTIN = 0x01, /* cleared on EADD */
> > +};
> > +
> > +#define SGX_TCS_RESERVED_MASK 0xFFFFFFFFFFFFFFFEL
>
> Would it be possible to separate out the SGX software structures from
> SGX hardware? It's hard to tell them apart.

How do you draw the line in the architectural structures?

> > +#define SGX_NR_TO_SCAN 16
> > +#define SGX_NR_LOW_PAGES 32
> > +#define SGX_NR_HIGH_PAGES 64
> > +
> > bool sgx_enabled __ro_after_init = false;
> > EXPORT_SYMBOL(sgx_enabled);
> > +bool sgx_lc_enabled __ro_after_init;
> > +EXPORT_SYMBOL(sgx_lc_enabled);
> > +atomic_t sgx_nr_free_pages = ATOMIC_INIT(0);
>
> Hmmm, global atomic. Doesn't sound very scalable.

We could potentially remove this completely as banks have 'free_cnt'
field and use the sum when needed as the value.

> > +struct sgx_epc_bank sgx_epc_banks[SGX_MAX_EPC_BANKS];
> > +EXPORT_SYMBOL(sgx_epc_banks);
> > +int sgx_nr_epc_banks;
> > +EXPORT_SYMBOL(sgx_nr_epc_banks);
> > +LIST_HEAD(sgx_active_page_list);
> > +EXPORT_SYMBOL(sgx_active_page_list);
> > +DEFINE_SPINLOCK(sgx_active_page_list_lock);
> > +EXPORT_SYMBOL(sgx_active_page_list_lock);
>
> Hmmm, global spinlock protecting a page allocator linked list. Sounds
> even worse than at atomic.
>
> Why is this OK?

Any suggestions what would be a better place in order to make a
fine grained granularity?

> > +static struct task_struct *ksgxswapd_tsk;
> > +static DECLARE_WAIT_QUEUE_HEAD(ksgxswapd_waitq);
> > +
> > +/*
> > + * Writing the LE hash MSRs is extraordinarily expensive, e.g.
> > + * 3-4x slower than normal MSRs, so we use a per-cpu cache to
> > + * track the last known value of the MSRs to avoid unnecessarily
> > + * writing the MSRs with the current value. Because most Linux
> > + * kernels will use an LE that is signed with a non-Intel key,
> > + * i.e. the first EINIT will need to write the MSRs regardless
> > + * of the cache, the cache is intentionally left uninitialized
> > + * during boot as initializing the cache would be pure overhead
> > + * for the majority of systems. Furthermore, the MSRs are per-cpu
> > + * and the boot-time values aren't guaranteed to be identical
> > + * across cpus, so we'd have to run code all all cpus to properly
> > + * init the cache. All in all, the complexity and overhead of
> > + * initializing the cache is not justified.
> > + */
> > +static DEFINE_PER_CPU(u64 [4], sgx_le_pubkey_hash_cache);
>
> Justifying the design decisions is great for changelogs, not so great
> for comments. Also, looking at this, I have no idea what this has to do
> with the "enclave page cache".

We'll have to revisit comment, I see your point.

> > +static void sgx_swap_cluster(void)
> > +{
> > + struct sgx_epc_page *cluster[SGX_NR_TO_SCAN + 1];
> > + struct sgx_epc_page *epc_page;
> > + int i;
> > + int j;
>
> This is rather free of comments or explanation of what this is doing,
> how it is related to swapping as everyone else knows it

I can document this function properly.

> > + memset(cluster, 0, sizeof(cluster));
> > +
> > + for (i = 0, j = 0; i < SGX_NR_TO_SCAN; i++) {
> > + spin_lock(&sgx_active_page_list_lock);
> > + if (list_empty(&sgx_active_page_list)) {
> > + spin_unlock(&sgx_active_page_list_lock);
> > + break;
> > + }
> > + epc_page = list_first_entry(&sgx_active_page_list,
> > + struct sgx_epc_page, list);
> > + if (!epc_page->impl->ops->get(epc_page)) {
> > + list_move_tail(&epc_page->list, &sgx_active_page_list);
> > + spin_unlock(&sgx_active_page_list_lock);
> > + continue;
> > + }
> > + list_del(&epc_page->list);
> > + spin_unlock(&sgx_active_page_list_lock);
> >
> > -static __init bool sgx_is_enabled(void)
> > + if (epc_page->impl->ops->reclaim(epc_page)) {
> > + cluster[j++] = epc_page;
> > + } else {
> > + spin_lock(&sgx_active_page_list_lock);
> > + list_add_tail(&epc_page->list, &sgx_active_page_list);
> > + spin_unlock(&sgx_active_page_list_lock);
> > + epc_page->impl->ops->put(epc_page);
> > + }
> > + }
> > +
> > + for (i = 0; cluster[i]; i++) {
> > + epc_page = cluster[i];
> > + epc_page->impl->ops->block(epc_page);
> > + }
> > +
> > + for (i = 0; cluster[i]; i++) {
> > + epc_page = cluster[i];
> > + epc_page->impl->ops->write(epc_page);
> > + epc_page->impl->ops->put(epc_page);
> > + sgx_free_page(epc_page);
> > + }
> > +}
>
> This is also gloriously free of any superfluous comments. Could you fix
> that?

Yes.

> > +/**
> > + * sgx_try_alloc_page - try to allocate an EPC page
> > + * @impl: implementation for the struct sgx_epc_page
> > + *
> > + * Try to grab a page from the free EPC page list. If there is a free page
> > + * available, it is returned to the caller.
> > + *
> > + * Return:
> > + * a &struct sgx_epc_page instace,
> > + * NULL otherwise
> > + */
> > +struct sgx_epc_page *sgx_try_alloc_page(struct sgx_epc_page_impl *impl)
> > +{
> > + struct sgx_epc_bank *bank;
> > + struct sgx_epc_page *page = NULL;
> > + int i;
> > +
> > + for (i = 0; i < sgx_nr_epc_banks; i++) {
> > + bank = &sgx_epc_banks[i];
>
> What's a bank? How many banks does a system have?

AFAIK, UMA systems have one bank. NUMA have multiple. It is a physical
memory region reserved for enclave pages.

> > + down_write(&bank->lock);
> > +
> > + if (atomic_read(&bank->free_cnt))
> > + page = bank->pages[atomic_dec_return(&bank->free_cnt)];
>
> Why is a semaphore getting used here? I don't see any sleeping or
> anything happening under this lock.

Should be changed to reader-writer spinlock, thanks.

> > + up_write(&bank->lock);
> > +
> > + if (page)
> > + break;
> > + }
> > +
> > + if (page) {
> > + atomic_dec(&sgx_nr_free_pages);
> > + page->impl = impl;
> > + }
> > +
> > + return page;
> > +}
> > +EXPORT_SYMBOL(sgx_try_alloc_page);
> > +
> > +/**
> > + * sgx_alloc_page - allocate an EPC page
> > + * @flags: allocation flags
> > + * @impl: implementation for the struct sgx_epc_page
> > + *
> > + * Try to grab a page from the free EPC page list. If there is a free page
> > + * available, it is returned to the caller. If called with SGX_ALLOC_ATOMIC,
> > + * the function will return immediately if the list is empty. Otherwise, it
> > + * will swap pages up until there is a free page available. Upon returning the
> > + * low watermark is checked and ksgxswapd is waken up if we are below it.
> > + *
> > + * Return:
> > + * a &struct sgx_epc_page instace,
> > + * -ENOMEM if all pages are unreclaimable,
> > + * -EBUSY when called with SGX_ALLOC_ATOMIC and out of free pages
> > + */
> > +struct sgx_epc_page *sgx_alloc_page(struct sgx_epc_page_impl *impl,
> > + unsigned int flags)
> > +{
> > + struct sgx_epc_page *entry;
> > +
> > + for ( ; ; ) {
> > + entry = sgx_try_alloc_page(impl);
> > + if (entry)
> > + break;
> > +
> > + if (list_empty(&sgx_active_page_list))
> > + return ERR_PTR(-ENOMEM);
>
> "active" pages in the VM are allocated/in-use pages. This doesn't look
> to be using the same terminology.
>
> > + if (flags & SGX_ALLOC_ATOMIC) {
> > + entry = ERR_PTR(-EBUSY);
> > + break;
> > + }
> > +
> > + if (signal_pending(current)) {
> > + entry = ERR_PTR(-ERESTARTSYS);
> > + break;
> > + }
> > +
> > + sgx_swap_cluster();
> > + schedule();
>
> What's the schedule trying to do? Is this the equivalent of "direct
> reclaim"? Why do we need this in addition to the ksgxswapd?

It tries to direct reclaim. Ugh, that schedule() call does not make
much sense though...

> > + }
> > +
> > + if (atomic_read(&sgx_nr_free_pages) < SGX_NR_LOW_PAGES)
> > + wake_up(&ksgxswapd_waitq);
> > +
> > + return entry;
> > +}
> > +EXPORT_SYMBOL(sgx_alloc_page);
>
> Why aren't these _GPL exports?

Source files a dual licensed.

> > +/**
> > + * sgx_free_page - free an EPC page
> > + *
> > + * @page: any EPC page
> > + *
> > + * Remove an EPC page and insert it back to the list of free pages.
> > + *
> > + * Return: SGX error code
> > + */
> > +int sgx_free_page(struct sgx_epc_page *page)
> > +{
> > + struct sgx_epc_bank *bank = SGX_EPC_BANK(page);
> > + int ret;
> > +
> > + ret = sgx_eremove(page);
> > + if (ret) {
> > + pr_debug("EREMOVE returned %d\n", ret);
> > + return ret;
> > + }
> > +
> > + down_read(&bank->lock);
> > + bank->pages[atomic_inc_return(&bank->free_cnt) - 1] = page;
> > + atomic_inc(&sgx_nr_free_pages);
> > + up_read(&bank->lock);
> > +
> > + return 0;
> > +}
>
> bank->lock confuses me. This seems to be writing to a bank, but only
> needs a read lock. Why?

It could be either way around:

1. Allow multiple threads that free a page to access the array.
2. Allow multiple threads that alloc a page to access the array.

> > +/**
> > + * sgx_get_page - pin an EPC page
> > + * @page: an EPC page
> > + *
> > + * Return: a pointer to the pinned EPC page
> > + */
> > +void *sgx_get_page(struct sgx_epc_page *page)
> > +{
> > + struct sgx_epc_bank *bank = SGX_EPC_BANK(page);
> > +
> > + if (IS_ENABLED(CONFIG_X86_64))
> > + return (void *)(bank->va + SGX_EPC_ADDR(page) - bank->pa);
> > +
> > + return kmap_atomic_pfn(SGX_EPC_PFN(page));
> > +}
> > +EXPORT_SYMBOL(sgx_get_page);
>
> This is odd. Do you really want to detect 64-bit, or CONFIG_HIGHMEM?

For 32-bit (albeit not supported at this point) it makes sense to always
use kmap_atomic_pfn() as the virtua address area is very limited.

> > +struct page *sgx_get_backing(struct file *file, pgoff_t index)
> > +{
> > + struct inode *inode = file->f_path.dentry->d_inode;
> > + struct address_space *mapping = inode->i_mapping;
> > + gfp_t gfpmask = mapping_gfp_mask(mapping);
> > +
> > + return shmem_read_mapping_page_gfp(mapping, index, gfpmask);
> > +}
> > +EXPORT_SYMBOL(sgx_get_backing);
>
> What does shmem have to do with all this?

Backing storage is an shmem file similarly is in drm.

> > +void sgx_put_backing(struct page *backing_page, bool write)
> > +{
> > + if (write)
> > + set_page_dirty(backing_page);
> > +
> > + put_page(backing_page);
> > +}
> > +EXPORT_SYMBOL(sgx_put_backing);
>
> I'm not a big fan of stuff getting added with no apparent user and no
> explaination of what it is doing. There's no way for me to assess
> whether this is sane or not.

I'll add the documetation.

> > +static __init int sgx_page_cache_init(void)
> > +{
> > + struct task_struct *tsk;
> > + unsigned long size;
> > + unsigned int eax;
> > + unsigned int ebx;
> > + unsigned int ecx;
> > + unsigned int edx;
> > + unsigned long pa;
> > + int i;
> > + int ret;
> > +
> > + for (i = 0; i < SGX_MAX_EPC_BANKS; i++) {
> > + cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC_BANKS, &eax, &ebx,
> > + &ecx, &edx);
> > + if (!(eax & 0xf))
> > + break;
> > +
> > + pa = ((u64)(ebx & 0xfffff) << 32) + (u64)(eax & 0xfffff000);
> > + size = ((u64)(edx & 0xfffff) << 32) + (u64)(ecx & 0xfffff000);
>
> Please align these like I did ^
>
> > + pr_info("EPC bank 0x%lx-0x%lx\n", pa, pa + size);
> > +
> > + ret = sgx_init_epc_bank(pa, size, i, &sgx_epc_banks[i]);
> > + if (ret) {
> > + sgx_page_cache_teardown();
> > + return ret;
> > + }
> > +
> > + sgx_nr_epc_banks++;
> > + }
>
> This is also rather sparsely commented.
>
> > +static __init bool sgx_is_enabled(bool *lc_enabled)
> > {
> > unsigned long fc;
> >
> > @@ -41,12 +466,26 @@ static __init bool sgx_is_enabled(void)
> > if (!(fc & FEATURE_CONTROL_SGX_ENABLE))
> > return false;
> >
> > + *lc_enabled = !!(fc & FEATURE_CONTROL_SGX_LE_WR);
> > +
> > return true;
> > }
>
> I'm baffled why lc_enabled is connected to the enclave page cache.

KVM works only with writable MSRs. Driver works both with writable
and read-only MSRs.

Thanks, I'll try my best to deal with all this :-)

/Jarkko