Re: [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions

From: Reinette Chatre
Date: Tue Mar 08 2022 - 12:50:58 EST


Hi Jarkko,

On 3/8/2022 9:00 AM, Jarkko Sakkinen wrote:
> On Tue, Mar 08, 2022 at 08:04:33AM -0800, Reinette Chatre wrote:
>> Hi Jarkko,
>>
>> On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote:
>>> On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
>>>> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
>>>>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
>>>>>> Hi Jarkko,
>>>>>>
>>>>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
>>>>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
>>>>>>>> === Summary ===
>>>>>>>>
>>>>>>>> An SGX VMA can only be created if its permissions are the same or
>>>>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
>>>>>>>> creation this same rule is again enforced by the page fault handler:
>>>>>>>> faulted enclave pages are required to have equal or more relaxed
>>>>>>>> EPCM permissions than the VMA permissions.
>>>>>>>>
>>>>>>>> On SGX1 systems the additional enforcement in the page fault handler
>>>>>>>> is redundant and on SGX2 systems it incorrectly prevents access.
>>>>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the
>>>>>>>> permission rule. The rule used during original VMA creation will
>>>>>>>> ensure that any access attempt will use correct permissions.
>>>>>>>> With SGX2 the EPCM permissions of a page can change after VMA
>>>>>>>> creation resulting in the VMA permissions potentially being more
>>>>>>>> relaxed than the EPCM permissions and the page fault handler
>>>>>>>> incorrectly blocking valid access attempts.
>>>>>>>>
>>>>>>>> Enable the VMA's pages to remain accessible while ensuring that
>>>>>>>> the PTEs are installed to match the EPCM permissions but not be
>>>>>>>> more relaxed than the VMA permissions.
>>>>>>>>
>>>>>>>> === Full Changelog ===
>>>>>>>>
>>>>>>>> An SGX enclave is an area of memory where parts of an application
>>>>>>>> can reside. First an enclave is created and loaded (from
>>>>>>>> non-enclave memory) with the code and data of an application,
>>>>>>>> then user space can map (mmap()) the enclave memory to
>>>>>>>> be able to enter the enclave at its defined entry points for
>>>>>>>> execution within it.
>>>>>>>>
>>>>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map
>>>>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is
>>>>>>>> its tracking of the enclave page permissions. When a page is loaded
>>>>>>>> into the enclave its permissions are specified and recorded in the
>>>>>>>> EPCM. In parallel the kernel maintains permissions within the
>>>>>>>> page table entries (PTEs) and the rule is that PTE permissions
>>>>>>>> are not allowed to be more relaxed than the EPCM permissions.
>>>>>>>>
>>>>>>>> A new mapping (mmap()) of enclave memory can only succeed if the
>>>>>>>> mapping has the same or weaker permissions than the permissions that
>>>>>>>> were vetted during enclave creation. This is enforced by
>>>>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
>>>>>>>> paths. This rule remains.
>>>>>>>>
>>>>>>>> One feature of SGX2 is to support the modification of EPCM permissions
>>>>>>>> after enclave initialization. Enclave pages may thus already be part
>>>>>>>> of a VMA at the time their EPCM permissions are changed resulting
>>>>>>>> in the VMA's permissions potentially being more relaxed than the EPCM
>>>>>>>> permissions.
>>>>>>>>
>>>>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM
>>>>>>>> permissions in preparation for dynamic EPCM permission changes
>>>>>>>> made possible in SGX2. New VMAs that attempt to have more relaxed
>>>>>>>> permissions than EPCM permissions continue to be unsupported.
>>>>>>>>
>>>>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed
>>>>>>>> than EPCM permissions instead of dynamically changing VMA permissions
>>>>>>>> when EPCM permissions change are:
>>>>>>>> 1) Changing VMA permissions involve splitting VMAs which is an
>>>>>>>> operation that can fail. Additionally changing EPCM permissions of
>>>>>>>> a range of pages could also fail on any of the pages involved.
>>>>>>>> Handling these error cases causes problems. For example, if an
>>>>>>>> EPCM permission change fails and the VMA has already been split
>>>>>>>> then it is not possible to undo the VMA split nor possible to
>>>>>>>> undo the EPCM permission changes that did succeed before the
>>>>>>>> failure.
>>>>>>>> 2) The kernel has little insight into the user space where EPCM
>>>>>>>> permissions are controlled from. For example, a RW page may
>>>>>>>> be made RO just before it is made RX and splitting the VMAs
>>>>>>>> while the VMAs may change soon is unnecessary.
>>>>>>>>
>>>>>>>> Remove the extra permission check called on a page fault
>>>>>>>> (vm_operations_struct->fault) or during debugging
>>>>>>>> (vm_operations_struct->access) when loading the enclave page from swap
>>>>>>>> that ensures that the VMA permissions are not more relaxed than the
>>>>>>>> EPCM permissions. Since a VMA could only exist if it passed the
>>>>>>>> original permission checks during mmap() and a VMA may indeed
>>>>>>>> have more relaxed permissions than the EPCM permissions this extra
>>>>>>>> permission check is no longer appropriate.
>>>>>>>>
>>>>>>>> With the permission check removed, ensure that PTEs do
>>>>>>>> not blindly inherit the VMA permissions but instead the permissions
>>>>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
>>>>>>>> and enclave perspective) are installed with the writable bit set,
>>>>>>>> reducing the need for this additional flow to the permission mismatch
>>>>>>>> cases handled next.
>>>>>>>>
>>>>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
>>>>>>>> ---
>>>>>>>> Changes since V1:
>>>>>>>> - Reword commit message (Jarkko).
>>>>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
>>>>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the
>>>>>>>> relationship between VMA, EPCM, and PTE permissions on SGX
>>>>>>>> systems (Andy).
>>>>>>>>
>>>>>>>> Documentation/x86/sgx.rst | 10 +++++++++
>>>>>>>> arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
>>>>>>>> 2 files changed, 30 insertions(+), 18 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
>>>>>>>> index 89ff924b1480..5659932728a5 100644
>>>>>>>> --- a/Documentation/x86/sgx.rst
>>>>>>>> +++ b/Documentation/x86/sgx.rst
>>>>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
>>>>>>>> * PTEs are installed to match the EPCM permissions, but not be more
>>>>>>>> relaxed than the VMA permissions.
>>>>>>>>
>>>>>>>> +On systems supporting SGX2 EPCM permissions may change while the
>>>>>>>> +enclave page belongs to a VMA without impacting the VMA permissions.
>>>>>>>> +This means that a running VMA may appear to allow access to an enclave
>>>>>>>> +page that is not allowed by its EPCM permissions. For example, when an
>>>>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
>>>>>>>> +subsequently changed to have read-only EPCM permissions. The kernel
>>>>>>>> +continues to maintain correct access to the enclave page through the
>>>>>>>> +PTE that will ensure that only access allowed by both the VMA
>>>>>>>> +and EPCM permissions are permitted.
>>>>>>>> +
>>>>>>>> Application interface
>>>>>>>> =====================
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> index 48afe96ae0f0..b6105d9e7c46 100644
>>>>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
>>>>>>>> }
>>>>>>>>
>>>>>>>> static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>>> - unsigned long addr,
>>>>>>>> - unsigned long vm_flags)
>>>>>>>> + unsigned long addr)
>>>>>>>> {
>>>>>>>> - unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>>> struct sgx_epc_page *epc_page;
>>>>>>>> struct sgx_encl_page *entry;
>>>>>>>>
>>>>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>>> if (!entry)
>>>>>>>> return ERR_PTR(-EFAULT);
>>>>>>>>
>>>>>>>> - /*
>>>>>>>> - * Verify that the faulted page has equal or higher build time
>>>>>>>> - * permissions than the VMA permissions (i.e. the subset of {VM_READ,
>>>>>>>> - * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
>>>>>>>> - */
>>>>>>>> - if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
>>>>>>>> - return ERR_PTR(-EFAULT);
>>>>>>>> -
>>>>>>>> /* Entry successfully located. */
>>>>>>>> if (entry->epc_page) {
>>>>>>>> if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>>>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>> {
>>>>>>>> unsigned long addr = (unsigned long)vmf->address;
>>>>>>>> struct vm_area_struct *vma = vmf->vma;
>>>>>>>> + unsigned long page_prot_bits;
>>>>>>>> struct sgx_encl_page *entry;
>>>>>>>> + unsigned long vm_prot_bits;
>>>>>>>> unsigned long phys_addr;
>>>>>>>> struct sgx_encl *encl;
>>>>>>>> vm_fault_t ret;
>>>>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>
>>>>>>>> mutex_lock(&encl->lock);
>>>>>>>>
>>>>>>>> - entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
>>>>>>>> + entry = sgx_encl_load_page(encl, addr);
>>>>>>>> if (IS_ERR(entry)) {
>>>>>>>> mutex_unlock(&encl->lock);
>>>>>>>
>>>>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>
>>>>>>>> phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>>>>>>>>
>>>>>>>> - ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
>>>>>>>> + /*
>>>>>>>> + * Insert PTE to match the EPCM page permissions ensured to not
>>>>>>>> + * exceed the VMA permissions.
>>>>>>>> + */
>>>>>>>> + vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>>> + page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
>>>>>>>> + /*
>>>>>>>> + * Add VM_SHARED so that PTE is made writable right away if VMA
>>>>>>>> + * and EPCM are writable (no COW in SGX).
>>>>>>>> + */
>>>>>>>> + page_prot_bits |= (vma->vm_flags & VM_SHARED);
>>>>>>>> + ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
>>>>>>>> + vm_get_page_prot(page_prot_bits));
>>>>>>>> if (ret != VM_FAULT_NOPAGE) {
>>>>>>>> mutex_unlock(&encl->lock);
>>>>>>>>
>>>>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
>>>>>>>> * Load an enclave page to EPC if required, and take encl->lock.
>>>>>>>> */
>>>>>>>> static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>>>>>>>> - unsigned long addr,
>>>>>>>> - unsigned long vm_flags)
>>>>>>>> + unsigned long addr)
>>>>>>>> {
>>>>>>>> struct sgx_encl_page *entry;
>>>>>>>>
>>>>>>>> for ( ; ; ) {
>>>>>>>> mutex_lock(&encl->lock);
>>>>>>>>
>>>>>>>> - entry = sgx_encl_load_page(encl, addr, vm_flags);
>>>>>>>> + entry = sgx_encl_load_page(encl, addr);
>>>>>>>> if (PTR_ERR(entry) != -EBUSY)
>>>>>>>> break;
>>>>>>>>
>>>>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>>>>>>>> return -EFAULT;
>>>>>>>>
>>>>>>>> for (i = 0; i < len; i += cnt) {
>>>>>>>> - entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
>>>>>>>> - vma->vm_flags);
>>>>>>>> + entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
>>>>>>>> if (IS_ERR(entry)) {
>>>>>>>> ret = PTR_ERR(entry);
>>>>>>>> break;
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
>>>>>>> pags, you would not need to do this.
>>>>>>>
>>>>>>> These patches could be then safely dropped then:
>>>>>>>
>>>>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions
>>>>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
>>>>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
>>>>>>>
>>>>>>> And that would also keep full ABI compatibility without exceptions to the
>>>>>>> existing mainline code.
>>>>>>>
>>>>>>
>>>>>> Dropping these changes do not just impact dynamically created pages. Dropping
>>>>>> these patches would result in EPCM page permission restriction being supported
>>>>>> for all pages, those added before enclave initialization as well as dynamically
>>>>>> added pages, but their PTEs will not be impacted.
>>>>>>
>>>>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
>>>>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
>>>>>> would keep allowing and installing RW PTEs to this page.
>>>>>
>>>>> I think that would be perfectly fine, if someone wants to do that. There is
>>>>> no corrateral damage on doing that. Kernel does not get messed because of
>>>>> that. It's a use case that does not make sense in the first place, so it'd
>>>>> be stupid to build anything extensive around it to the kernel.
>>>>>
>>>>> Shooting yourself to the foot is something that kernel does and should not
>>>>> protect user space from unless there is a risk of messing the state of the
>>>>> kernel itself.
>>>>>
>>>>> Much worse is that we have e.g. completely artificial ioctl
>>>>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
>>>>> cause extra roundtrips for simple EMODPE.
>>>>>
>>>>> Also this means not having to include 06/32, which keeps 100% backwards
>>>>> compatibility in run-time behaviour to the mainline while not restricting
>>>>> at all dynamically created pages. And we get rid of complex book keeping
>>>>> of vm_run_prot_bits.
>>>>>
>>>>> And generally the whole model is then very easy to understand and explain.
>>>>> If I had to keep presentation of the current mess in the patch set in a
>>>>> conference, I can honestly say that I would be in serious trouble. It's
>>>>> not clean and clear security model, which is a risk by itself.
>>>>
>>>> I.e.
>>>>
>>>> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
>>>> not change it by any means (e.g. 06/32).
>>>> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
>>>> what ever you want with PTE's and EPCM.
>>>>
>>>> It's a clear and understandable model that does nothing bad to the kernel,
>>>> and a run-time developer can surely find away to get things on going. For
>>>> user space, the most important thing is the clarity in kernel behaviour,
>>>> and this does deliver that clarity. It's not perfect but it does do the
>>>> job and anyone can get it.
>>>
>>> Also a quantitive argument for this is that by simplifying security model
>>> this way it is one ioctl less, which must be considered as +1. We do not
>>> want to add new ioctls unless it is something we absolutely cannnot live
>>> without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.
>>>
>>
>> ok, with the implications understood and accepted I will proceed with a new
>> series that separates EPCM from PTEs and make RWX PTEs possible by default
>> for EAUG pages. This has broader impact than just removing
>> the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler
>> for present PTEs" is also no longer needed and there is no longer a need
>> to flush PTEs after restricting permissions. New changes also need to
>> be considered - at least the current documentation. I'll rework the series.
>
> Yes, I really think it is a solid plan. Any possible LSM hooks would most
> likely attach to build product, not the dynamic behaviour.
>
> As far as the page fault handler goes, Haitao is correct after the all
> discussions that it makes sense. The purpose of MAP_POPULATE series is
> not to replace it but instead complement it. Just wanted to clear this
> up as I said otherwise earlier this week.
>

Understood. I will keep the implementation where EAUG is done in page fault
handler. I do plan to pick up your patch "x86/sgx: Export sgx_encl_page_alloc()"
since a consequence of the other changes is that this can now be shared.

Reinette