Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
From: Kees Cook
Date: Sat May 30 2020 - 13:31:05 EST
On Sat, May 30, 2020 at 01:59:53AM -0400, Gabriel Krisman Bertazi wrote:
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
>
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux. Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.
Interesting use-case! It seems like you're in the position of needing to
invert the assumption about syscalls: before you knew everything going
through WinAPI needed emulation, and now you need to assume everything
not going through a native library needs emulation. Oof.
Is it possible to disassemble and instrument the Windows code to insert
breakpoints (or emulation calls) at all the Windows syscall points?
> [...]
> * Why not SECCOMP_MODE_FILTER?
>
> We experimented with dynamically generating BPF filters for whitelisted
> memory regions and using SECCOMP_MODE_FILTER, but there are a few
> reasons why it isn't enough nor a good idea for our use case:
>
> 1. We cannot set the filters at program initialization time and forget
> about it, since there is no way of knowing which modules will be loaded,
> whether native and windows. Filter would need a way to be updated
> frequently during game execution.
>
> 2. We cannot predict which Linux libraries will issue syscalls directly.
> Most of the time, whitelisting libc and a few other libraries is enough,
> but there are no guarantees other Linux libraries won't issue syscalls
> directly and break the execution. Adding every linux library that is
> loaded also has a large performance cost due to the large resulting
> filter.
Just so I can understand the expected use: given the dynamic nature of
the library loading, how would Wine be marking the VMAs?
> 3. As I mentioned before, performance is critical. In our testing with
> just a single memory segment blacklisted/whitelisted, the minimum size
> of a bpf filter would be 4 instructions. In that scenario,
> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
> time of sysinfo(2) in comparison to seccomp disabled, while the impact
> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.
Was the BPF JIT enabled? I was recently examining filter performance too:
https://lore.kernel.org/linux-security-module/202005291043.A63D910A8@keescook/
> Indeed, points 1 and 2 could be worked around with some userspace work
> and improved SECCOMP_MODE_FILTER support, but at a high performance and
> some stability cost, to obtain the semantics we want. Still, the
> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
> enough that I believe it should be considered as an upstream solution.
It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?
> Sending as an RFC for now to get the discussion started. In particular:
>
> 1) Would this design be acceptable?
Maybe? I want to spend a little time exploring alternatives first. :)
> 2) I'm not comfortable with using the high part of vm_flags, though I'm
> not sure where else it would fit. Suggestions?
I don't know this area very well. Hopefully some of the mm folks can
comment on it. It seems like there needs to be a more dynamic way mark
VMAs for arbitrary purposes (since this isn't _actually_ a PROT* flag,
it's just a marking for seccomp to check).
Regardless, here's a review of the specifics of patch, without regard to
the design question above. :)
>
> CC: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> CC: Kees Cook <keescook@xxxxxxxxxxxx>
> CC: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> CC: Will Drewry <wad@xxxxxxxxxxxx>
> CC: H. Peter Anvin <hpa@xxxxxxxxx>
> CC: Paul Gofman <gofmanp@xxxxxxxxx>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx>
> ---
> arch/Kconfig | 21 +++++++++
> arch/x86/Kconfig | 1 +
> include/linux/mm.h | 8 ++++
> include/linux/mman.h | 4 +-
> include/uapi/asm-generic/mman-common.h | 2 +
> include/uapi/linux/seccomp.h | 2 +
> kernel/seccomp.c | 59 ++++++++++++++++++++++++++
> mm/mprotect.c | 2 +-
> 8 files changed, 97 insertions(+), 2 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 786a85d4ad40..e5c10231c9c2 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -471,6 +471,27 @@ config SECCOMP_FILTER
>
> See Documentation/userspace-api/seccomp_filter.rst for details.
>
> +config HAVE_ARCH_SECCOMP_MEMMAP
> + bool
> + help
> + An arch should select this symbol if it provides all of these things:
> + - syscall_get_arch()
> + - syscall_get_arguments()
> + - syscall_rollback()
> + - syscall_set_return_value()
> + - SIGSYS siginfo_t support
> + - secure_computing is called from a ptrace_event()-safe context
> + - secure_computing return value is checked and a return value of -1
> + results in the system call being skipped immediately.
> + - seccomp syscall wired up
> +
> +config SECCOMP_MEMMAP
> + def_bool y
> + depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP
> + help
> + Enable tasks to trap syscalls based on the page that triggered
> + the syscall.
> +
Since I don't see anything distinct in the requirements for
SECCOMP_MEMMAP, so I don't think it needs a separate CONFIG: it is almost
entirely built on the SECCOMP_FILTER infrastructure. I'd just drop the
CONFIGs.
> config HAVE_ARCH_STACKLEAK
> bool
> help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2d3f963fd6f1..70998b01c533 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -145,6 +145,7 @@ config X86
> select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT
> select HAVE_ARCH_PREL32_RELOCATIONS
> select HAVE_ARCH_SECCOMP_FILTER
> + select HAVE_ARCH_SECCOMP_MEMMAP
(and it's not arch-specific)
> select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> select HAVE_ARCH_STACKLEAK
> select HAVE_ARCH_TRACEHOOK
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a323422d783..6271fb7fd5bc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp);
> #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
> #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
> #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */
> #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
> #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
> #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
> #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5)
> #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>
> #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
> # define VM_GROWSUP VM_NONE
> #endif
>
> +#if defined(CONFIG_X86)
> +# define VM_NOSYSCALL VM_HIGH_ARCH_5
> +#else
> +# define VM_NOSYSCALL VM_NONE
> +#endif
> +
> /* Bits set in the VMA until the stack is in its final location */
> #define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
>
I have no idea what the correct way to mark VMAs would be. As you
mentioned, this seems ... wrong. :)
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 4b08e9c9c538..a5ca42eb685a 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages)
> */
> static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
> {
> - return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
> + return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
> + | PROT_NOSYSCALL)) == 0;
> }
> #define arch_validate_prot arch_validate_prot
> #endif
> @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
> return _calc_vm_trans(prot, PROT_READ, VM_READ ) |
> _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
> _calc_vm_trans(prot, PROT_EXEC, VM_EXEC) |
> + _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
> arch_calc_vm_prot_bits(prot, pkey);
> }
>
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..4eafbaa3f4bc 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -13,6 +13,8 @@
> #define PROT_SEM 0x8 /* page may be used for atomic ops */
> /* 0x10 reserved for arch-specific use */
> /* 0x20 reserved for arch-specific use */
> +#define PROT_NOSYSCALL 0x40 /* page cannot trigger syscalls */
> +
> #define PROT_NONE 0x0 /* page can not be accessed */
> #define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
> #define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
AIUI, all of the above is to plumb the VMA marking through an mprotect()
call. I wonder if perhaps madvise() would be better? I'm not sure how
tight we are on new flags there, but I think it would be cleaner to use
that interface. Take a look at MADV_WIPEONFORK / VM_WIPEONFORK.
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index c1735455bc53..c7d042a409e7 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -10,12 +10,14 @@
> #define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
> #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
> #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
> +#define SECCOMP_MODE_MEMMAP 3 /* Lock syscalls per memory region. */
Making this incompatible with FILTER might cause problems for the future
(more and more process launchers are starting to set filters).
So this would, perhaps, be a new flag for struct seccomp, rather than a
new operating mode.
>
> /* Valid operations for seccomp syscall. */
> #define SECCOMP_SET_MODE_STRICT 0
> #define SECCOMP_SET_MODE_FILTER 1
> #define SECCOMP_GET_ACTION_AVAIL 2
> #define SECCOMP_GET_NOTIF_SIZES 3
> +#define SECCOMP_SET_MODE_MEMMAP 4
>
> /* Valid flags for SECCOMP_SET_MODE_FILTER */
> #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 55a6184f5990..ebf09b02db8d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> }
> #endif
>
> +#ifdef CONFIG_SECCOMP_MEMMAP
> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
> +{
> + struct vm_area_struct *vma = find_vma(current->mm,
> + sd->instruction_pointer);
> + if (!vma)
> + BUG();
No new kernel code should use BUG:
https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
I would maybe pr_warn_once(), but then treat it as if it was marked with
VM_NOSYSCALL.
> +
> + if (!(vma->vm_flags & VM_NOSYSCALL))
> + return 0;
> +
> + syscall_rollback(current, task_pt_regs(current));
> + seccomp_send_sigsys(this_syscall, 0);
> +
> + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
> +
> + return -1;
> +}
This really just looks like an ip_address filter, but I get what you
mean about stacking filters, etc. This may finally be the day we turn to
eBPF in seccomp, since that would give you access to using a map lookup
on ip_address, and the map could be updated externally (removing the
need for the madvise() changes).
> +
> +static long seccomp_set_mode_memmap(unsigned int flags)
> +{
> + const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
> + long ret = 0;
> +
> + if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> + return -EINVAL;
This should just test for any bits in flags. TSYNC + memmap should be
avoided, IMO.
> +
> + spin_lock_irq(¤t->sighand->siglock);
> +
> + if (seccomp_may_assign_mode(seccomp_mode))
> + seccomp_assign_mode(current, seccomp_mode, flags);
> + else
> + ret = -EINVAL;
> +
> + spin_unlock_irq(¤t->sighand->siglock);
> +
> + return ret;
> +}
> +#else
> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
> +{
> + BUG();
> +}
> +static long seccomp_set_mode_memmap(unsigned int flags)
> +{
> + return -EINVAL;
> +}
> +#endif /* CONFIG_SECCOMP_MEMMAP */
> +
> int __secure_computing(const struct seccomp_data *sd)
> {
> int mode = current->seccomp.mode;
> @@ -948,6 +997,8 @@ int __secure_computing(const struct seccomp_data *sd)
> return 0;
> case SECCOMP_MODE_FILTER:
> return __seccomp_filter(this_syscall, sd, false);
> + case SECCOMP_MODE_MEMMAP:
> + return __seccomp_memmap(this_syscall, sd);
> default:
> BUG();
> }
> @@ -1425,6 +1476,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
> return -EINVAL;
>
> return seccomp_get_notif_sizes(uargs);
> + case SECCOMP_SET_MODE_MEMMAP:
> + if (uargs != NULL)
> + return -EINVAL;
> + return seccomp_set_mode_memmap(flags);
> default:
> return -EINVAL;
> }
> @@ -1462,6 +1517,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter)
> op = SECCOMP_SET_MODE_FILTER;
> uargs = filter;
> break;
> + case SECCOMP_MODE_MEMMAP:
> + op = SECCOMP_SET_MODE_MEMMAP;
> + uargs = NULL;
> + break;
The prctl() interface is deprecated, so no new features should be added
there.
> default:
> return -EINVAL;
> }
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 494192ca954b..6b5c00e8bbdc 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
> * cleared from the VMA.
> */
> mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
> - VM_FLAGS_CLEAR;
> + VM_NOSYSCALL | VM_FLAGS_CLEAR;
>
> new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
> newflags = calc_vm_prot_bits(prot, new_vma_pkey);
> --
> 2.27.0.rc2
>
So, yes, interesting. I'll continue to think about how this could work
best. Can you share what your filters looked like when you were
originally trying those?
Thanks!
-Kees
--
Kees Cook