Re: [PATCH v38 13/24] x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES

From: Jarkko Sakkinen
Date: Mon Sep 21 2020 - 14:50:02 EST


On Mon, Sep 21, 2020 at 09:46:48AM -0700, Sean Christopherson wrote:
> > This is also true. I meant by corrupt state e.g. a kernel bug, which
> > causes uninitalizes pages go the free queue.
> >
> > I'd rephrase this in kdoc as: "The function deinitializes enclave and
> > returns -EIO when EPC is lost, while entering to a new power cycle".
>
> The kdocs shouldn't speculate on why EEXTEND might fail. E.g. in some (and
> possibility most) environments, the most common scenario of EEXTEND failure
> will be EPC invalidation due to virtual machine migration.
>
> This is why I'd prefer that the kernel kill the enclave if and only if the
> error is guaranteed to be fatal, e.g. the docs can have a blanket statement
> along the lines of:
>
> An enclave will be killed and its EPC resources will be freed if an error that
> is guaranteed to be fatal is encountered at any time, e.g. if EEXTEND fails as
> EEXTEND can only fail due to loss of EPC, a kernel bug, or silicon bug, all of
> which are unrecoverable.

Kernel bug is not a legit condition. Neither is a silicon failure. We do
not document speculated kernel bugs. If we used that kind of pattern for
documentation, we would have to put similar statements about every
single line of code.

Describing legit failure conditions with the best knowledge available is
the whole point why people read documentation in the first place.
Otherwise, the documentation has absolutely no value.

Documentation is also always, without exception, inaccurate. Lacking
something is not an issue, if it is not done on purpose.

I'd refine what I did as

"The function deinitializes enclave and returns -EIO when EPC was lost,
while entering to a new power cycle, or any other condition where EPC
gets invalidated."

It is not perfect, nothing ever is, but it is heck a lot more useful
than being too generic to fail.

> > > EADD is a little different, e.g. it could fault due to a bad source address,
> > > in which case the failure is not technically fatal. But, Jarkko wanted to
> > > have consistent behavior for EADD and EEXTEND failures, and practically
> > > speaking the enclave is probably hosed anyways if EADD fails, i.e. killing the
> > > enclave on EADD failure isn't a sticking point (for me).
> >
> > We need to figure out own return value for EADD, but I agree with this.
> >
> > I would go with -EFAULT as we do when source VMA is no available. Does
> > this make sense to you?
>
> If only EEXTEND will be treated as fatal, then I see no need to worry about
> the return code for EADD. In that case, simply kill the enclave on EEXTEND
> failure instead of on a specific return code.

To have understandable semantics you have to map error codes to
conditions rather than opcodes. -EIO means loss of enclave in the event
of EPC gone invalid. Enclave is already lost, that is the reason why we
deinitialize the kernel data structures.

EADD must have a different error code because nothing is actually lost
but the failure conditions are triggered outside. -EFAULT would be
probably the most reasonable choice for that.

/Jarkko