[PATCH V4 09/10] x86/pks: Add PKS kernel API

From: ira . weiny
Date: Mon Mar 22 2021 - 01:31:52 EST


From: Fenghua Yu <fenghua.yu@xxxxxxxxx>

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain. Export 5 new symbols pks_key_alloc(), pks_mk_noaccess(),
pks_mk_readonly(), pks_mk_readwrite(), and pks_key_free(). Add 2 new
macros; PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx>
Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx>
Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx>
Signed-off-by: Fenghua Yu <fenghua.yu@xxxxxxxxx>

---
Changes from V3:
From Dan Williams
Remove flags from pks_key_alloc()
Convert to ARCH_ENABLE_SUPERVISOR_PKEYS
remove export of update_pkey_val()
Update documentation
change __clear_bit to clear_bit_unlock
remove cpu_feature_enabled from pks_key_free
remove pr_err stubs when CONFIG_HAS_SUPERVISOR_PKEYS=n
clarify pks_key_alloc flags parameter with enum
Update documentation for ARCH_ENABLE_SUPERVISOR_PKEYS
No need to export write_pkrs
Correct Kernel Doc for API functions
From Randy Dunlap:
Fix grammatical errors in doc

Changes from V2
From Greg KH
Replace all WARN_ON_ONCE() uses with pr_err()
From Dan Williams
Add __must_check to pks_key_alloc() to help ensure users
are using the API correctly

Changes from V1
Per Dave Hansen
Add flags to pks_key_alloc() to help future proof the
interface if/when the key space is exhausted.

Changes from RFC V3
Per Dave Hansen
Put WARN_ON_ONCE in pks_key_free()
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change return pks_key_alloc() to EOPNOTSUPP when not
supported or configured
Per Peter Zijlstra
Remove unneeded preempt disable/enable
---
Documentation/core-api/protection-keys.rst | 108 +++++++++++++---
arch/x86/include/asm/pgtable_types.h | 12 ++
arch/x86/include/asm/pks.h | 4 +
arch/x86/mm/pkeys.c | 137 ++++++++++++++++++++-
include/linux/pgtable.h | 4 +
include/linux/pkeys.h | 17 +++
6 files changed, 263 insertions(+), 19 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..6d6c4f25080c 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,30 @@
Memory Protection Keys
======================

-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables
+when an application changes protection domains.

-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later. And it will be available in future
+non-server Intel parts and future AMD processors.

-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains. It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+Protection Keys for Supervisor pages (PKS) is available in the SDM since May
+2020.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys. User and Supervisor pages are
+treated separately.

-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key. Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+Protections for each page are controlled with per-CPU registers for each type
+of page User and Supervisor. Each of these 32-bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
+
+For Userspace the register is user-accessible (rdpkru/wrpkru). For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.

There are two new instructions (RDPKRU/WRPKRU) for reading and writing
to the new register. The feature is only available in 64-bit mode,
@@ -30,8 +35,11 @@ even though there is theoretically space in the PAE PTEs. These
permissions are enforced on data access only and have no effect on
instruction fetches.

-Syscalls
-========
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+============================

There are 3 system calls which directly interact with pkeys::

@@ -98,3 +106,67 @@ with a read()::
The kernel will send a SIGSEGV in both cases, but si_code will be set
to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==========================
+
+Similar to user space pkeys, supervisor pkeys allow additional protections to
+be defined for a supervisor mappings.
+
+The following interface is used to allocate, use, and free a pkey which defines
+a 'protection domain' within the kernel. Setting a pkey value in a supervisor
+PTE adds this additional protection to the page.
+
+Kernel users intending to use PKS support should check (depend on)
+ARCH_HAS_SUPERVISOR_PKEYS and add their config to ARCH_ENABLE_SUPERVISOR_PKEYS
+to turn on this support within the core.
+
+ int pks_key_alloc(const char * const pkey_user);
+ #define PAGE_KERNEL_PKEY(pkey)
+ #define _PAGE_KEY(pkey)
+ void pks_mk_noaccess(int pkey);
+ void pks_mk_readonly(int pkey);
+ void pks_mk_readwrite(int pkey);
+ void pks_key_free(int pkey);
+
+pks_key_alloc() allocates keys dynamically to allow better use of the limited
+key space.
+
+Callers of pks_key_alloc() _must_ be prepared for it to fail and take
+appropriate action. This is due mainly to the fact that PKS may not be
+available on all arch's. Failure to check the return of pks_key_alloc() and
+using any of the rest of the API is undefined.
+
+Keys are allocated with 'No Access' permissions. If other permissions are
+required before the pkey is used, the pks_mk*() family of calls, documented
+below, can be used prior to setting the pkey within the page table entries.
+
+Kernel users must set the pkey in the page table entries for the mappings they
+want to protect. This can be done with PAGE_KERNEL_PKEY() or _PAGE_KEY().
+
+The pks_mk*() family of calls allows kernel users to change the protections for
+the domain identified by the pkey parameter. 3 states are available:
+pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which set the
+access to none, read, and read/write respectively.
+
+Finally, pks_key_free() allows a user to return the key to the allocator for
+use by others.
+
+The interface maintains pks_mk_noaccess() (Access Disabled (AD=1)) for all keys
+not currently allocated. Therefore, the user can depend on access being
+disabled when pks_key_alloc() returns a key and the user should remove mappings
+from the domain (remove the pkey from the PTE) prior to calling pks_key_free().
+
+It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
+but still maintains ordering properties similar to WRPKRU. Thus it is safe to
+immediately use a mapping when the pks_mk*() functions return.
+
+Older versions of the SDM on PKRS may be wrong with regard to this
+serialization. The text should be the same as that of WRPKRU. From the WRPKRU
+text:
+
+ WRPKRU will never execute transiently. Memory accesses
+ affected by PKRU register will not execute (even transiently)
+ until all prior executions of WRPKRU have completed execution
+ and updated the PKRU register.
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f24d7ef8fffa..a3cb274351d9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -73,6 +73,12 @@
_PAGE_PKEY_BIT2 | \
_PAGE_PKEY_BIT3)

+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define _PAGE_PKEY(pkey) (_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0)
+#else
+#define _PAGE_PKEY(pkey) (_AT(pteval_t, 0))
+#endif
+
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
#else
@@ -228,6 +234,12 @@ enum page_cache_mode {
#define PAGE_KERNEL_IO __pgprot_mask(__PAGE_KERNEL_IO)
#define PAGE_KERNEL_IO_NOCACHE __pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)

+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define PAGE_KERNEL_PKEY(pkey) __pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey))
+#else
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
#endif /* __ASSEMBLY__ */

/* xwr */
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index bfa638e17620..4891c9aa8fc7 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -4,6 +4,10 @@

#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS

+/* PKS supports 16 keys. Key 0 is reserved for the kernel. */
+#define PKS_KERN_DEFAULT_KEY 0
+#define PKS_NUM_KEYS 16
+
struct extended_pt_regs {
u32 thread_pkrs;
/* Keep stack 8 byte aligned */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f6a3a54b8d7d..47d29707ac39 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -3,6 +3,9 @@
* Intel Memory Protection Keys management
* Copyright (c) 2015, Intel Corporation.
*/
+#undef pr_fmt
+#define pr_fmt(fmt) "x86/pkeys: " fmt
+
#include <linux/debugfs.h> /* debugfs_create_u32() */
#include <linux/mm_types.h> /* mm_struct, vma, etc... */
#include <linux/pkeys.h> /* PKEY_* */
@@ -11,6 +14,7 @@
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/mmu_context.h> /* vma_pkey() */
#include <asm/fpu/internal.h> /* init_fpstate */
+#include <asm/pks.h>

int __execute_only_pkey(struct mm_struct *mm)
{
@@ -276,4 +280,135 @@ void setup_pks(void)
cr4_set_bits(X86_CR4_PKS);
}

-#endif
+/*
+ * Do not call this directly, see pks_mk*() below.
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ *
+ */
+static inline void pks_update_protection(int pkey, unsigned long protection)
+{
+ current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
+ pkey, protection);
+ write_pkrs(current->thread.saved_pkrs);
+}
+
+/**
+ * pks_mk_noaccess() - Disable all access to the domain
+ * @pkey the pkey for which the access should change.
+ *
+ * Disable all access to the domain specified by pkey. This is a global
+ * update and only affects the current running thread.
+ *
+ * It is a bug for users to call this without a valid pkey returned from
+ * pks_key_alloc()
+ */
+void pks_mk_noaccess(int pkey)
+{
+ pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+}
+EXPORT_SYMBOL_GPL(pks_mk_noaccess);
+
+/**
+ * pks_mk_readonly() - Make the domain Read only
+ * @pkey the pkey for which the access should change.
+ *
+ * Allow read access to the domain specified by pkey. This is a global update
+ * and only affects the current running thread.
+ *
+ * It is a bug for users to call this without a valid pkey returned from
+ * pks_key_alloc()
+ */
+void pks_mk_readonly(int pkey)
+{
+ pks_update_protection(pkey, PKEY_DISABLE_WRITE);
+}
+EXPORT_SYMBOL_GPL(pks_mk_readonly);
+
+/**
+ * pks_mk_readwrite() - Make the domain Read/Write
+ * @pkey the pkey for which the access should change.
+ *
+ * Allow all access, read and write, to the domain specified by pkey. This is
+ * a global update and only affects the current running thread.
+ *
+ * It is a bug for users to call this without a valid pkey returned from
+ * pks_key_alloc()
+ */
+void pks_mk_readwrite(int pkey)
+{
+ pks_update_protection(pkey, 0);
+}
+EXPORT_SYMBOL_GPL(pks_mk_readwrite);
+
+static const char pks_key_user0[] = "kernel";
+
+/* Store names of allocated keys for debug. Key 0 is reserved for the kernel. */
+static const char *pks_key_users[PKS_NUM_KEYS] = {
+ pks_key_user0
+};
+
+/*
+ * Each key is represented by a bit. Bit 0 is set for key 0 and reserved for
+ * its use. We use ulong for the bit operations but only 16 bits are used.
+ */
+static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
+
+/**
+ * pks_key_alloc() - Allocate a PKS key
+ * @pkey_user: String stored for debugging of key exhaustion. The caller is
+ * responsible to maintain this memory until pks_key_free().
+ *
+ * Return: pkey if success
+ * -EOPNOTSUPP if pks is not supported or not enabled
+ * -ENOSPC if no keys are available
+ */
+__must_check int pks_key_alloc(const char * const pkey_user)
+{
+ int nr;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return -EOPNOTSUPP;
+
+ while (1) {
+ nr = find_first_zero_bit(&pks_key_allocation_map, PKS_NUM_KEYS);
+ if (nr >= PKS_NUM_KEYS) {
+ pr_info("Cannot allocate supervisor key for %s.\n",
+ pkey_user);
+ return -ENOSPC;
+ }
+ if (!test_and_set_bit_lock(nr, &pks_key_allocation_map))
+ break;
+ }
+
+ /* for debugging key exhaustion */
+ pks_key_users[nr] = pkey_user;
+
+ return nr;
+}
+EXPORT_SYMBOL_GPL(pks_key_alloc);
+
+/**
+ * pks_key_free() - Free a previously allocate PKS key
+ * @pkey: Key to be free'ed
+ */
+void pks_key_free(int pkey)
+{
+ if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY) {
+ pr_err("Invalid PKey value: %d\n", pkey);
+ return;
+ }
+
+ /* Restore to default of no access */
+ pks_mk_noaccess(pkey);
+ pks_key_users[pkey] = NULL;
+ clear_bit_unlock(pkey, &pks_key_allocation_map);
+}
+EXPORT_SYMBOL_GPL(pks_key_free);
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdfc4e9f253e..1e5f4a253e82 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1460,6 +1460,10 @@ static inline bool arch_has_pfn_modify_check(void)
# define PAGE_KERNEL_EXEC PAGE_KERNEL
#endif

+#ifndef PAGE_KERNEL_PKEY
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
/*
* Page Table Modification bits for pgtbl_mod_mask.
*
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index a3d17a8e4e81..6659404af876 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -56,6 +56,13 @@ static inline void copy_init_pkru_to_fpregs(void)
void pkrs_save_set_irq(struct pt_regs *regs, u32 val);
void pkrs_restore_irq(struct pt_regs *regs);

+__must_check int pks_key_alloc(const char *const pkey_user);
+void pks_key_free(int pkey);
+
+void pks_mk_noaccess(int pkey);
+void pks_mk_readonly(int pkey);
+void pks_mk_readwrite(int pkey);
+
#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

#ifndef INIT_PKRS_VALUE
@@ -65,6 +72,16 @@ void pkrs_restore_irq(struct pt_regs *regs);
static inline void pkrs_save_set_irq(struct pt_regs *regs, u32 val) { }
static inline void pkrs_restore_irq(struct pt_regs *regs) { }

+static inline __must_check int pks_key_alloc(const char * const pkey_user)
+{
+ return -EOPNOTSUPP;
+}
+
+static inline void pks_key_free(int pkey) {}
+static inline void pks_mk_noaccess(int pkey) {}
+static inline void pks_mk_readonly(int pkey) {}
+static inline void pks_mk_readwrite(int pkey) {}
+
#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

#endif /* _LINUX_PKEYS_H */
--
2.28.0.rc0.12.gb6a658bd00c9