Re: [PATCH 5/9] x86, pkeys: allocation/free syscalls
From: Mel Gorman
Date: Thu Jul 07 2016 - 10:40:36 EST
On Thu, Jul 07, 2016 at 05:47:27AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>
> This patch adds two new system calls:
>
> int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
> int pkey_free(int pkey);
>
> These implement an "allocator" for the protection keys
> themselves, which can be thought of as analogous to the allocator
> that the kernel has for file descriptors. The kernel tracks
> which numbers are in use, and only allows operations on keys that
> are valid. A key which was not obtained by pkey_alloc() may not,
> for instance, be passed to pkey_mprotect() (or the forthcoming
> get/set syscalls).
>
Ok, so the last patch wired up the system call before the kernel was
tracking which numbers were in use. It doesn't really matter as such but
the patches should be swapped around and only expose the systemcall when
it's actually safe.
> These system calls are also very important given the kernel's use
> of pkeys to implement execute-only support. These help ensure
> that userspace can never assume that it has control of a key
> unless it first asks the kernel.
>
> The 'init_access_rights' argument to pkey_alloc() specifies the
> rights that will be established for the returned pkey. For
> instance:
>
> pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
>
> will allocate 'pkey', but also sets the bits in PKRU[1] such that
> writing to 'pkey' is already denied. This keeps userspace from
> needing to have knowledge about manipulating PKRU with the
> RDPKRU/WRPKRU instructions. Userspace is still free to use these
> instructions as it wishes, but this facility ensures it is no
> longer required.
>
> The kernel does _not_ enforce that this interface must be used for
> changes to PKRU, even for keys it does not control.
>
> The kernel does not prevent pkey_free() from successfully freeing
> in-use pkeys (those still assigned to a memory range by
> pkey_mprotect()). It would be expensive to implement the checks
> for this, so we instead say, "Just don't do it" since sane
> software will never do it anyway.
>
Unfortunately, it could manifest as either corruption due to an area
expected to be protected being accessible or an unexpected SEGV.
I accept the expensive arguement but it opens a new class of problems
that userspace debuggers will need to evaluate.
> diff -puN arch/x86/include/asm/mmu_context.h~pkeys-116-syscalls-allocation arch/x86/include/asm/mmu_context.h
> --- a/arch/x86/include/asm/mmu_context.h~pkeys-116-syscalls-allocation 2016-07-07 05:47:01.435831049 -0700
> +++ b/arch/x86/include/asm/mmu_context.h 2016-07-07 05:47:01.454831911 -0700
> @@ -108,7 +108,16 @@ static inline void enter_lazy_tlb(struct
> static inline int init_new_context(struct task_struct *tsk,
> struct mm_struct *mm)
> {
> + #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> + if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
> + /* pkey 0 is the default and always allocated */
> + mm->context.pkey_allocation_map = 0x1;
> + /* -1 means unallocated or invalid */
> + mm->context.execute_only_pkey = -1;
> + }
> + #endif
> init_new_context_ldt(tsk, mm);
> +
> return 0;
> }
> static inline void destroy_context(struct mm_struct *mm)
I prevents userspace modifying the default key from userspace with WRPKRU
or an unallocated key for that matter. However, I also cannot find a case
where it really matters. An application screwing it up may ask mprotect
to do something very unexpected but that's about it.
> +static inline
> +bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey)
> +{
> + if (!validate_pkey(pkey))
> + return true;
> +
> + return mm_pkey_allocation_map(mm) & (1 << pkey);
> +}
> +
We flip-flop between whether pkey is signed or unsigned.
> +SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
> +{
> + int pkey;
> + int ret;
> +
> + /* No flags supported yet. */
> + if (flags)
> + return -EINVAL;
> + /* check for unsupported init values */
> + if (init_val & ~PKEY_ACCESS_MASK)
> + return -EINVAL;
> +
> + down_write(¤t->mm->mmap_sem);
> + pkey = mm_pkey_alloc(current->mm);
> +
> + ret = -ENOSPC;
> + if (pkey == -1)
> + goto out;
> +
> + ret = arch_set_user_pkey_access(current, pkey, init_val);
> + if (ret) {
> + mm_pkey_free(current->mm, pkey);
> + goto out;
> + }
> + ret = pkey;
> +out:
> + up_write(¤t->mm->mmap_sem);
> + return ret;
> +}
It's not wrong as such but mmap_sem taken for write seems *extremely*
heavy to protect the allocation mask. If userspace is racing a key
allocation with mprotect, it's already game over in terms of random
behaviour.
I've no idea what the frequency of pkey alloc/free is expected to be. If
it's really low then maybe it doesn't matter but if it's high this is
going to be a bottleneck later.
> +
> +SYSCALL_DEFINE1(pkey_free, int, pkey)
> +{
> + int ret;
> +
> + down_write(¤t->mm->mmap_sem);
> + ret = mm_pkey_free(current->mm, pkey);
> + up_write(¤t->mm->mmap_sem);
> +
> + /*
> + * We could provie warnings or errors if any VMA still
> + * has the pkey set here.
> + */
> + return ret;
> +}
> _
--
Mel Gorman
SUSE Labs