Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()

From: Kai Huang
Date: Mon Mar 22 2021 - 19:17:53 EST


On Mon, 22 Mar 2021 23:37:26 +0100 Borislav Petkov wrote:
> On Tue, Mar 23, 2021 at 11:06:43AM +1300, Kai Huang wrote:
> > This path is called by host SGX driver only, so yes this leaking is done by
> > host enclaves only.
>
> Yes, so I was told.
>
> > This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so virtual
> > EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently.
> > This patch doesn't have intention to bring functional change. I changed the
> > error msg because Dave said it worth to give a message saying EPC page is
> > leaked, and I thought this minor change won't break anything.
>
> I read that already - you don't have to repeat it.
>
> > btw, currently virtual EPC code (patch 5) handles in similar way: There's one
> > EREMOVE error is expected and virtual EPC code can handle, but for other
> > errors, it means kernel bug, and virtual EPC code gives a WARN(), and that EPC
> > page is leaked too:
> >
> > + WARN_ONCE(ret != SGX_CHILD_PRESENT,
> > + "EREMOVE (EPC page 0x%lx): unexpected error: %d\n",
> > + sgx_get_epc_phys_addr(epc_page), ret);
> > + return ret;
> >
> > So to me they are just WARN() to catch kernel bug.
>
> You don't care about users, do you? Because when that error happens,
> they won't come crying to you to fix it.
>
> Lemme save you some trouble: we don't take half-baked code into the
> kernel until stuff has been discussed and analyzed properly. So instead
> of trying to downplay this, try answering my questions.
>
> Here's another one: when does EREMOVE fail?
>
> /me goes and looks it up
>
> "The instruction fails if the operand is not properly aligned or does
> not refer to an EPC page or the page is in use by another thread, or
> other threads are running in the enclave to which the page belongs. In
> addition the instruction fails if the operand refers to an SECS with
> associations."
>
> And I guess those conditions will become more in the future.
>
> Now, let's play. I'm the cloud admin and you're cloud OS customer
> support. I say:
>
> "I got this scary error message while running enclaves on my server
>
> "EREMOVE returned ... . EPC page leaked. Reboot required to retrieve leaked pages."
>
> but I cannot reboot that machine because there are guests running on it
> and I'm getting paid for those guests and I might get sued if I do?"
>
> Your turn, go wild.

I suppose admin can migrate those VMs, and then engineers can analyse the root
cause of such failure, and then fix it.