[PATCH v10 00/18] Add support for Nitro Enclaves
From: Andra Paraschiv
Date: Mon Sep 21 2020 - 08:18:11 EST
Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
that allows customers to carve out isolated compute environments within EC2
instances [1].
For example, an application that processes sensitive data and runs in a VM,
can be separated from other applications running in the same VM. This
application then runs in a separate VM than the primary VM, namely an enclave.
An enclave runs alongside the VM that spawned it. This setup matches low latency
applications needs. The resources that are allocated for the enclave, such as
memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
process running in the primary VM, that communicates with the NE driver via an
ioctl interface.
In this sense, there are two components:
1. An enclave abstraction process - a user space process running in the primary
VM guest that uses the provided ioctl interface of the NE driver to spawn an
enclave VM (that's 2 below).
There is a NE emulated PCI device exposed to the primary VM. The driver for this
new PCI device is included in the NE driver.
The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
maps to an enclave start PCI command. The PCI device commands are then
translated into actions taken on the hypervisor side; that's the Nitro
hypervisor running on the host where the primary VM is running. The Nitro
hypervisor is based on core KVM technology.
2. The enclave itself - a VM running on the same host as the primary VM that
spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
for the enclave VM. An enclave does not have persistent storage attached.
The memory regions carved out of the primary VM and given to an enclave need to
be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
The enclave memory and CPUs need to be from the same NUMA node.
An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
available for the primary VM. A CPU pool has to be set for NE purposes by an
user with admin capability. See the cpu list section from the kernel
documentation [4] for how a CPU pool format looks.
An enclave communicates with the primary VM via a local communication channel,
using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
uses eventfd for signaling. The enclave VM sees the usual interfaces - local
APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
device is placed in memory below the typical 4 GiB.
The application that runs in the enclave needs to be packaged in an enclave
image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
enclave VM. The enclave VM has its own kernel and follows the standard Linux
boot protocol [6].
The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
Enclave Image Format (EIF); plus an EIF header including metadata such as magic
number, eif version, image size and CRC.
Hash values are computed for the entire enclave image (EIF), the kernel and
ramdisk(s). That's used, for example, to check that the enclave image that is
loaded in the enclave VM is the one that was intended to be run.
These crypto measurements are included in a signed attestation document
generated by the Nitro Hypervisor and further used to prove the identity of the
enclave; KMS is an example of service that NE is integrated with and that checks
the attestation doc.
The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
init process in the enclave connects to the vsock CID of the primary VM and a
predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
used to check in the primary VM that the enclave has booted. The CID of the
primary VM is 3.
If the enclave VM crashes or gracefully exits, an interrupt event is received by
the NE driver. This event is sent further to the user space enclave process
running in the primary VM via a poll notification mechanism. Then the user space
enclave process can exit.
Thank you.
Andra
[1] https://aws.amazon.com/ec2/nitro/nitro-enclaves/
[2] https://www.kernel.org/doc/html/latest/admin-guide/mm/hugetlbpage.html
[3] https://lwn.net/Articles/807108/
[4] https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
[5] https://man7.org/linux/man-pages/man7/vsock.7.html
[6] https://www.kernel.org/doc/html/latest/x86/boot.html
---
Patch Series Changelog
The patch series is built on top of v5.9-rc6.
GitHub repo branch for the latest version of the patch series:
* https://github.com/andraprs/linux/tree/ne-driver-upstream-v10
* v9 -> v10
* Rebase on top of v5.9-rc6.
* Update commit messages to include the changelog before the SoB tag(s).
* v9: https://github.com/andraprs/linux/tree/ne-driver-upstream-v9
(it seems there have been several issues with the mail delivery to
the mailing list itself, that's why GitHub refs, instead of lore)
v8 -> v9
* Rebase on top of v5.9-rc4.
* Add data structure to keep references to both Nitro Enclaves misc and PCI
devices. These refs are used in the NE PCI device driver probe / remove /
shutdow flow, in the ioctl logic for the enclave VM creation and during the
NE CPU pool setup.
* Update the location of the NE documentation to the "virt" directory and
include it in the corresponding index file.
* Remove -Wall flags from the NE driver Makefile, could use W=1 as an option.
* v8: https://lore.kernel.org/lkml/20200904173718.64857-1-andraprs@xxxxxxxxxx/
v7 -> v8
* Rebase on top of v5.9-rc3.
* Add NE customer error codes for invalid flags and enclave CID provided to the
ioctl calls, and memory regions not backed by pages multiple of 2 MiB.
* Add NE PCI driver shutdown logic.
* Add check for invalid provided enclave CID to the start enclave ioctl.
* Update documentation to include info about the primary / parent VM CID for its
vsock device. Update reference link for huge pages and include refs for the
x86 boot protocol.
* Update sample to track the newly added NE custom error codes and match the
latest logic for the heartbeat enclave boot check.
* v7: https://lore.kernel.org/lkml/20200817131003.56650-1-andraprs@xxxxxxxxxx/
v6 -> v7
* Rebase on top of v5.9-rc1.
* Use the NE misc device parent field to get the NE PCI device.
* Update the naming and add more comments to make more clear the logic of
handling full CPU cores and dedicating them to the enclave.
* Remove, for now, the dependency on ARM64 arch in Kconfig. x86 is currently
supported, with Arm to come afterwards. The NE kernel driver can be currently
built for aarch64 arch.
* Clarify in the ioctls documentation that the return value is -1 and errno is
set on failure.
* Update the error code value for NE_ERR_INVALID_MEM_REGION_SIZE as it gets in
user space as value 25 (ENOTTY) instead of 515. Update the NE custom error
codes values range to not be the same as the ones defined in
include/linux/errno.h, although these are not propagated to user space.
* Update the documentation to include references to the NE PCI device id and
MMIO bar.
* Update check for duplicate user space memory regions to cover additional
possible scenarios.
* Calculate the number of threads per core and not use smp_num_siblings that is
x86 specific.
* v6: https://lore.kernel.org/lkml/20200805091017.86203-1-andraprs@xxxxxxxxxx/
v5 -> v6
* Rebase on top of v5.8.
* Update documentation to kernel-doc format.
* Update sample to include the enclave image loading logic.
* Remove the ioctl to query API version.
* Check for invalid provided flags field via ioctl calls args.
* Check for duplicate provided user space memory regions.
* Check for aligned memory regions.
* Include, in the sample, usage info for NUMA-aware hugetlb config.
* v5: https://lore.kernel.org/lkml/20200715194540.45532-1-andraprs@xxxxxxxxxx/
v4 -> v5
* Rebase on top of v5.8-rc5.
* Add more details about the ioctl calls usage e.g. error codes.
* Update the ioctl to set an enclave vCPU to not return a fd.
* Add specific NE error codes.
* Split the NE CPU pool in CPU cores cpumasks.
* Remove log on copy_from_user() / copy_to_user() failure.
* Release the reference to the NE PCI device on failure paths.
* Close enclave fd on copy_to_user() failure.
* Set empty string in case of invalid NE CPU pool sysfs value.
* Early exit on NE CPU pool setup if enclave(s) already running.
* Add more sanity checks for provided vCPUs e.g. maximum possible value.
* Split logic for checking if a vCPU is in pool / getting a vCPU from pool.
* Exit without unpinning the pages on NE PCI dev request failure.
* Add check for the memory region user space address alignment.
* Update the logic to set memory region to not have a hardcoded check for 2 MiB.
* Add arch dependency for Arm / x86.
* v4: https://lore.kernel.org/lkml/20200622200329.52996-1-andraprs@xxxxxxxxxx/
v3 -> v4
* Rebase on top of v5.8-rc2.
* Add NE API version and the corresponding ioctl call.
* Add enclave / image load flags options.
* Decouple NE ioctl interface from KVM API.
* Remove the "packed" attribute and include padding in the NE data structures.
* Update documentation based on the changes from v4.
* Update sample to match the updates in v4.
* Remove the NE CPU pool init during NE kernel module loading.
* Setup the NE CPU pool at runtime via a sysfs file for the kernel parameter.
* Check if the enclave memory and CPUs are from the same NUMA node.
* Add minimum enclave memory size definition.
* v3: https://lore.kernel.org/lkml/20200525221334.62966-1-andraprs@xxxxxxxxxx/
v2 -> v3
* Rebase on top of v5.7-rc7.
* Add changelog to each patch in the series.
* Remove "ratelimited" from the logs that are not in the ioctl call paths.
* Update static calls sanity checks.
* Remove file ops that do nothing for now.
* Remove GPL additional wording as SPDX-License-Identifier is already in place.
* v2: https://lore.kernel.org/lkml/20200522062946.28973-1-andraprs@xxxxxxxxxx/
v1 -> v2
* Rebase on top of v5.7-rc6.
* Adapt codebase based on feedback from v1.
* Update ioctl number definition - major and minor.
* Add sample / documentation for the ioctl interface basic flow usage.
* Update cover letter to include more context on the NE overall.
* Add fix for the enclave / vcpu fd creation error cleanup path.
* Add fix reported by kbuild test robot <lkp@xxxxxxxxx>.
* v1: https://lore.kernel.org/lkml/20200421184150.68011-1-andraprs@xxxxxxxxxx/
---
Andra Paraschiv (18):
nitro_enclaves: Add ioctl interface definition
nitro_enclaves: Define the PCI device interface
nitro_enclaves: Define enclave info for internal bookkeeping
nitro_enclaves: Init PCI device driver
nitro_enclaves: Handle PCI device command requests
nitro_enclaves: Handle out-of-band PCI device events
nitro_enclaves: Init misc device providing the ioctl interface
nitro_enclaves: Add logic for creating an enclave VM
nitro_enclaves: Add logic for setting an enclave vCPU
nitro_enclaves: Add logic for getting the enclave image load info
nitro_enclaves: Add logic for setting an enclave memory region
nitro_enclaves: Add logic for starting an enclave
nitro_enclaves: Add logic for terminating an enclave
nitro_enclaves: Add Kconfig for the Nitro Enclaves driver
nitro_enclaves: Add Makefile for the Nitro Enclaves driver
nitro_enclaves: Add sample for ioctl interface usage
nitro_enclaves: Add overview documentation
MAINTAINERS: Add entry for the Nitro Enclaves driver
.../userspace-api/ioctl/ioctl-number.rst | 5 +-
Documentation/virt/index.rst | 1 +
Documentation/virt/ne_overview.rst | 95 +
MAINTAINERS | 13 +
drivers/virt/Kconfig | 2 +
drivers/virt/Makefile | 2 +
drivers/virt/nitro_enclaves/Kconfig | 20 +
drivers/virt/nitro_enclaves/Makefile | 9 +
drivers/virt/nitro_enclaves/ne_misc_dev.c | 1733 +++++++++++++++++
drivers/virt/nitro_enclaves/ne_misc_dev.h | 109 ++
drivers/virt/nitro_enclaves/ne_pci_dev.c | 625 ++++++
drivers/virt/nitro_enclaves/ne_pci_dev.h | 327 ++++
include/linux/nitro_enclaves.h | 11 +
include/uapi/linux/nitro_enclaves.h | 359 ++++
samples/nitro_enclaves/.gitignore | 2 +
samples/nitro_enclaves/Makefile | 16 +
samples/nitro_enclaves/ne_ioctl_sample.c | 883 +++++++++
17 files changed, 4211 insertions(+), 1 deletion(-)
create mode 100644 Documentation/virt/ne_overview.rst
create mode 100644 drivers/virt/nitro_enclaves/Kconfig
create mode 100644 drivers/virt/nitro_enclaves/Makefile
create mode 100644 drivers/virt/nitro_enclaves/ne_misc_dev.c
create mode 100644 drivers/virt/nitro_enclaves/ne_misc_dev.h
create mode 100644 drivers/virt/nitro_enclaves/ne_pci_dev.c
create mode 100644 drivers/virt/nitro_enclaves/ne_pci_dev.h
create mode 100644 include/linux/nitro_enclaves.h
create mode 100644 include/uapi/linux/nitro_enclaves.h
create mode 100644 samples/nitro_enclaves/.gitignore
create mode 100644 samples/nitro_enclaves/Makefile
create mode 100644 samples/nitro_enclaves/ne_ioctl_sample.c
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.