Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

From: Eric W. Biederman
Date: Wed Sep 22 2021 - 10:23:30 EST


Peter Collingbourne <pcc@xxxxxxxxxx> writes:

> This patch introduces a kernel feature known as uaccess logging.
> With uaccess logging, the userspace program passes the address and size
> of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> is a request for the kernel to log any uaccesses made during the next
> syscall to the uaccess buffer. When the next syscall returns, the address
> one past the end of the logged uaccess buffer entries is written to the
> location specified by the third argument to the prctl(). In this way,
> the userspace program may enumerate the uaccesses logged to the access
> buffer to determine which accesses occurred.
>
> Uaccess logging has several use cases focused around bug detection
> tools:
>
> 1) Userspace memory safety tools such as ASan, MSan, HWASan and tools
> making use of the ARM Memory Tagging Extension (MTE) need to monitor
> all memory accesses in a program so that they can detect memory
> errors. For accesses made purely in userspace, this is achieved
> via compiler instrumentation, or for MTE, via direct hardware
> support. However, accesses made by the kernel on behalf of the
> user program via syscalls (i.e. uaccesses) are invisible to these
> tools. With MTE there is some level of error detection possible in
> the kernel (in synchronous mode, bad accesses generally result in
> returning -EFAULT from the syscall), but by the time we get back to
> userspace we've lost the information about the address and size of the
> failed access, which makes it harder to produce a useful error report.
>
> With the current versions of the sanitizers, we address this by
> interposing the libc syscall stubs with a wrapper that checks the
> memory based on what we believe the uaccesses will be. However, this
> creates a maintenance burden: each syscall must be annotated with
> its uaccesses in order to be recognized by the sanitizer, and these
> annotations must be continuously updated as the kernel changes. This
> is especially burdensome for syscalls such as ioctl(2) which have a
> large surface area of possible uaccesses.
>
> 2) Verifying the validity of kernel accesses. This can be achieved in
> conjunction with the userspace memory safety tools mentioned in (1).
> Even a sanitizer whose syscall wrappers have complete knowledge of
> the kernel's intended API may vary from the kernel's actual uaccesses
> due to kernel bugs. A sanitizer with knowledge of the kernel's actual
> uaccesses may produce more accurate error reports that reveal such
> bugs.
>
> An example of such a bug, which was found by an earlier version of this
> patch together with a prototype client of the API in HWASan, was fixed
> by commit d0efb16294d1 ("net: don't unconditionally copy_from_user
> a struct ifreq for socket ioctls"). Although this bug turned out to
> relatively harmless, it was a bug nonetheless and it's always possible
> that more serious bugs of this sort may be introduced in the future.
>
> 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> guide a kernel fuzzing tool such as syzkaller (so that it knows which
> parts of user memory to fuzz), as an alternative to providing the tool
> with a list of syscalls and their uaccesses (which again thanks to
> (2) may not be accurate).

How is logging the kernel's activity like this not a significant
information leak? How is this safe for unprivileged users?

Eric



> All signals except SIGKILL and SIGSTOP are masked for the interval
> between the prctl() and the next syscall in order to prevent handlers
> for intervening asynchronous signals from issuing syscalls that may
> cause uaccesses from the wrong syscall to be logged.
>
> The format of a uaccess buffer entry is defined as follows:
>
> struct access_buffer_entry {
> u64 addr, size, flags;
> };
>
> The meaning of addr and size should be obvious. On arm64, tag bits
> are preserved in the addr field. The current meaning of the flags
> field is that bit 0 indicates whether the access was a read (clear)
> or a write (set). The meaning of all other flag bits is reserved.
> All fields are of type u64 in order to avoid compat concerns.
>
> Here is an example of a code snippet that will enumerate the accesses
> performed by a uname(2) syscall:
>
> struct access_buffer_entry entries[64];
> uint64_t entries_end64 = (uint64_t)&entries;
> struct utsname un;
> prctl(PR_LOG_UACCESS, entries, sizeof(entries), &entries_end64, 0);
> uname(&un);
> struct access_buffer_entry *entries_end = (struct uaccess_buffer_entry *)entries_end64;
> for (struct acccess_buffer_entry *i = entries; i != entries_end; ++i) {
> printf("%s at 0x%lu size 0x%lx\n",
> entries[i].flags & UACCESS_BUFFER_FLAG_WRITE ? "WRITE" : "READ",
> (unsigned long)entries[i].addr, (unsigned long)entries[i].size);
> }
>
> Uaccess buffers are a "best-effort" mechanism for logging uaccesses. Of
> course, not all of the accesses may fit in the buffer, but aside from
> that, there are syscalls such as async I/O that are currently missed due
> to the uaccesses occurring on a different kernel task (this is analogous
> to how async I/O accesses are exempt from userspace MTE checks). We
> view this as acceptable, as the access buffer can be sized sufficiently
> large to handle syscalls that make a reasonable number of uaccesses,
> and syscalls that use a different task for uaccesses are rare. In
> many cases, the sanitizer does not need to see every memory access,
> so it's fine if we miss the odd uaccess here and there. Even for those
> sanitizers that do need to see every memory access it still represents
> a much lower maintenance burden if we just have to handle the unusual
> syscalls specially.
>
> Because we don't have a common kernel entry/exit code path that is used
> on all architectures, uaccess logging is only implemented for arm64 and
> architectures that use CONFIG_GENERIC_ENTRY, i.e. x86 and s390.
>
> One downside of this ABI is that it involves making two syscalls per
> "real" syscall, which can harm performance. One possible way to avoid
> this may be to have the prctl() register the uaccess buffer location
> once at thread startup and use the same location for all syscalls in
> the thread. However, because the program may be making syscalls very
> early, before TLS is available, this may not always work. Furthermore,
> because of the same asynchronous signal concerns that prompted temporarily
> masking signals after the prctl(), the syscall stub would need to be made
> reentrant, and it is unclear whether this is feasible without manually
> masking asynchronous signals using rt_sigprocmask(2) while reading the
> uaccess buffer, defeating the purpose of avoiding the extra syscall.
>
> One idea that we considered involved using the stack pointer address as
> a unique identifier for the syscall, but this currently would need to be
> arch-specific as we currently do not appear to have an arch-generic way
> of retrieving the stack pointer; the userspace side would also need some
> arch-specific code for this to work. It's also possible that a longjmp()
> past the signal handler would make the stack pointer address not unique
> enough for this purpose.
>
> On the other hand, by allocating the uaccess log on the stack and blocking
> asynchronous signals for the interval between the prctl() and the "real"
> syscall, we can avoid any reentrancy and TLS concerns.
>
> Another way to avoid the overhead may be to use an architecture-specific
> calling convention to pass the address of the uaccess buffer to the kernel
> at syscall time in registers currently unused for syscall arguments. For
> example, one arm64-specific scheme that was used in a previous iteration
> of the patch was:
>
> - Bit 0 of the immediate argument to the SVC instruction must be set.
> - Register X6 contains the address of the access buffer.
> - Register X7 contains the size of the access buffer in bytes.
> - On return, X6 will contain the address of the memory location following
> any access buffer entries written by the kernel.
>
> However, this would need to be implemented separately for each
> architecture (and some of them don't have enough registers anyway),
> whereas the prctl() is (at least in theory) architecture-generic.
>
> We also evaluated implementing this on top of the existing tracepoint
> facility, but concluded that it is not suitable for this purpose:
>
> - Tracepoints have a per-task granularity at best, whereas we really want
> to trace per-syscall. This is so that we can exclude syscalls that
> should not be traced, such as syscalls that make up part of the
> sanitizer implementation (to avoid infinite recursion when e.g. printing
> an error report).
>
> - Tracing would need to be synchronous in order to produce useful
> stack traces. For example this could be achieved using the new SIGTRAP
> on perf events mechanism. However, this would require logging each
> access to the stack (in the form of a sigcontext) and this is more
> likely to overflow the stack due to being much larger than a uaccess
> buffer entry as well as being unbounded, in contrast to the bounded
> buffer size passed to prctl(). An approach based on signal handlers is
> also likely to fall foul of the asynchronous signal issues mentioned
> previously, together with needing sigreturn to be handled specially
> (because it copies a sigcontext from userspace) otherwise we could
> never return from the signal handler. Furthermore, arguments to the
> trace events are not available to SIGTRAP. (This on its own wouldn't
> be insurmountable though -- we could add the arguments as fields
> to siginfo.)
>
> - The API in https://www.kernel.org/doc/Documentation/trace/ftrace.txt
> -- e.g. trace_pipe_raw gives access to the internal ring buffer, but
> I don't think it's useable because it's per-CPU and not per-task.
>
> - Tracepoints can be used by eBPF programs, but eBPF programs may
> only be loaded as root, among other potential headaches.
>
> Link: https://linux-review.googlesource.com/id/I6581765646501a5631b281d670903945ebadc57d
> Signed-off-by: Peter Collingbourne <pcc@xxxxxxxxxx>
> ---
> arch/Kconfig | 6 ++
> arch/arm64/Kconfig | 1 +
> arch/arm64/kernel/syscall.c | 2 +
> include/linux/instrumented.h | 5 +-
> include/linux/sched.h | 3 +
> include/linux/uaccess_buffer.h | 43 ++++++++++
> include/linux/uaccess_buffer_info.h | 23 ++++++
> include/uapi/linux/prctl.h | 9 +++
> kernel/Makefile | 1 +
> kernel/entry/common.c | 3 +
> kernel/sys.c | 6 ++
> kernel/uaccess_buffer.c | 118 ++++++++++++++++++++++++++++
> 12 files changed, 219 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/uaccess_buffer.h
> create mode 100644 include/linux/uaccess_buffer_info.h
> create mode 100644 kernel/uaccess_buffer.c
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8df1c7102643..a427f6440cc9 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -31,6 +31,7 @@ config HOTPLUG_SMT
> bool
>
> config GENERIC_ENTRY
> + select UACCESS_BUFFER
> bool
>
> config KPROBES
> @@ -1288,6 +1289,11 @@ config ARCH_HAS_ELFCORE_COMPAT
> config ARCH_HAS_PARANOID_L1D_FLUSH
> bool
>
> +config UACCESS_BUFFER
> + bool
> + help
> + Select if the architecture's syscall entry/exit code supports uaccess buffers.
> +
> source "kernel/gcov/Kconfig"
>
> source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 5c7ae4c3954b..4764e5fd7ba9 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -221,6 +221,7 @@ config ARM64
> select THREAD_INFO_IN_TASK
> select HAVE_ARCH_USERFAULTFD_MINOR if USERFAULTFD
> select TRACE_IRQFLAGS_SUPPORT
> + select UACCESS_BUFFER
> help
> ARM 64-bit (AArch64) Linux support.
>
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index 50a0f1a38e84..c3f8652d84a5 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -139,7 +139,9 @@ static void el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
> goto trace_exit;
> }
>
> + uaccess_buffer_syscall_entry();
> invoke_syscall(regs, scno, sc_nr, syscall_table);
> + uaccess_buffer_syscall_exit();
>
> /*
> * The tracing status may have changed under our feet, so we have to
> diff --git a/include/linux/instrumented.h b/include/linux/instrumented.h
> index 42faebbaa202..9144936edcb1 100644
> --- a/include/linux/instrumented.h
> +++ b/include/linux/instrumented.h
> @@ -2,7 +2,7 @@
>
> /*
> * This header provides generic wrappers for memory access instrumentation that
> - * the compiler cannot emit for: KASAN, KCSAN.
> + * the compiler cannot emit for: KASAN, KCSAN, access buffers.
> */
> #ifndef _LINUX_INSTRUMENTED_H
> #define _LINUX_INSTRUMENTED_H
> @@ -11,6 +11,7 @@
> #include <linux/kasan-checks.h>
> #include <linux/kcsan-checks.h>
> #include <linux/types.h>
> +#include <linux/uaccess_buffer.h>
>
> /**
> * instrument_read - instrument regular read access
> @@ -117,6 +118,7 @@ instrument_copy_to_user(void __user *to, const void *from, unsigned long n)
> {
> kasan_check_read(from, n);
> kcsan_check_read(from, n);
> + uaccess_buffer_log_write(to, n);
> }
>
> /**
> @@ -134,6 +136,7 @@ instrument_copy_from_user(const void *to, const void __user *from, unsigned long
> {
> kasan_check_write(to, n);
> kcsan_check_write(to, n);
> + uaccess_buffer_log_read(from, n);
> }
>
> #endif /* _LINUX_INSTRUMENTED_H */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e12b524426b0..3fecb0487b97 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -34,6 +34,7 @@
> #include <linux/rseq.h>
> #include <linux/seqlock.h>
> #include <linux/kcsan.h>
> +#include <linux/uaccess_buffer_info.h>
> #include <asm/kmap_size.h>
>
> /* task_struct member predeclarations (sorted alphabetically): */
> @@ -1487,6 +1488,8 @@ struct task_struct {
> struct callback_head l1d_flush_kill;
> #endif
>
> + struct uaccess_buffer_info uaccess_buffer;
> +
> /*
> * New fields for task_struct should be added above here, so that
> * they are included in the randomized portion of task_struct.
> diff --git a/include/linux/uaccess_buffer.h b/include/linux/uaccess_buffer.h
> new file mode 100644
> index 000000000000..3b81f2a192a4
> --- /dev/null
> +++ b/include/linux/uaccess_buffer.h
> @@ -0,0 +1,43 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_H
> +#define _LINUX_ACCESS_BUFFER_H
> +
> +#include <asm-generic/errno-base.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n);
> +void uaccess_buffer_log_write(void __user *to, unsigned long n);
> +
> +void uaccess_buffer_syscall_entry(void);
> +void uaccess_buffer_syscall_exit(void);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr);
> +
> +#else
> +
> +static inline void uaccess_buffer_log_read(const void __user *from,
> + unsigned long n)
> +{
> +}
> +static inline void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> +}
> +
> +static inline void uaccess_buffer_syscall_entry(void)
> +{
> +}
> +static inline void uaccess_buffer_syscall_exit(void)
> +{
> +}
> +
> +static inline int uaccess_buffer_set_logging(unsigned long addr,
> + unsigned long size,
> + unsigned long store_end_addr)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_H */
> diff --git a/include/linux/uaccess_buffer_info.h b/include/linux/uaccess_buffer_info.h
> new file mode 100644
> index 000000000000..a6cefe6e73b5
> --- /dev/null
> +++ b/include/linux/uaccess_buffer_info.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_ACCESS_BUFFER_INFO_H
> +#define _LINUX_ACCESS_BUFFER_INFO_H
> +
> +#include <uapi/asm/signal.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +struct uaccess_buffer_info {
> + unsigned long addr, size;
> + unsigned long store_end_addr;
> + sigset_t saved_sigmask;
> + u8 state;
> +};
> +
> +#else
> +
> +struct uaccess_buffer_info {
> +};
> +
> +#endif
> +
> +#endif /* _LINUX_ACCESS_BUFFER_INFO_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 43bd7f713c39..d8baacaef800 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -269,4 +269,13 @@ struct prctl_mm_map {
> # define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */
> # define PR_SCHED_CORE_MAX 4
>
> +/* Log uaccesses to a user-provided buffer */
> +#define PR_LOG_UACCESS 63
> +
> +/* Format of the entries in the uaccess log. */
> +struct uaccess_buffer_entry {
> + __u64 addr, size, flags;
> +};
> +# define UACCESS_BUFFER_FLAG_WRITE 1 /* access was a write */
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 4df609be42d0..75a5d95ce9c3 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -115,6 +115,7 @@ obj-$(CONFIG_KCSAN) += kcsan/
> obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
> obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
> obj-$(CONFIG_CFI_CLANG) += cfi.o
> +obj-$(CONFIG_UACCESS_BUFFER) += uaccess_buffer.o
>
> obj-$(CONFIG_PERF_EVENTS) += events/
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..c7e7ff8cbab3 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -89,6 +89,8 @@ __syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
> if (work & SYSCALL_WORK_ENTER)
> syscall = syscall_trace_enter(regs, syscall, work);
>
> + uaccess_buffer_syscall_entry();
> +
> return syscall;
> }
>
> @@ -273,6 +275,7 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
> local_irq_enable();
> }
>
> + uaccess_buffer_syscall_exit();
> rseq_syscall(regs);
>
> /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 8fdac0d90504..df487600773c 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -42,6 +42,7 @@
> #include <linux/version.h>
> #include <linux/ctype.h>
> #include <linux/syscall_user_dispatch.h>
> +#include <linux/uaccess_buffer.h>
>
> #include <linux/compat.h>
> #include <linux/syscalls.h>
> @@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> error = sched_core_share_pid(arg2, arg3, arg4, arg5);
> break;
> #endif
> + case PR_LOG_UACCESS:
> + if (arg5)
> + return -EINVAL;
> + error = uaccess_buffer_set_logging(arg2, arg3, arg4);
> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/kernel/uaccess_buffer.c b/kernel/uaccess_buffer.c
> new file mode 100644
> index 000000000000..b9da89887c4b
> --- /dev/null
> +++ b/kernel/uaccess_buffer.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/compat.h>
> +#include <linux/prctl.h>
> +#include <linux/sched.h>
> +#include <linux/signal.h>
> +#include <linux/uaccess.h>
> +#include <linux/uaccess_buffer.h>
> +#include <linux/uaccess_buffer_info.h>
> +
> +#ifdef CONFIG_UACCESS_BUFFER
> +
> +/*
> + * We use a separate implementation of copy_to_user() that avoids the call
> + * to instrument_copy_to_user() as this would otherwise lead to infinite
> + * recursion.
> + */
> +static unsigned long
> +uaccess_buffer_copy_to_user(void __user *to, const void *from, unsigned long n)
> +{
> + if (!access_ok(to, n))
> + return n;
> + return raw_copy_to_user(to, from, n);
> +}
> +
> +static void uaccess_buffer_log(unsigned long addr, unsigned long size,
> + unsigned long flags)
> +{
> + struct uaccess_buffer_entry entry;
> +
> + if (current->uaccess_buffer.size < sizeof(entry) ||
> + unlikely(uaccess_kernel()))
> + return;
> + entry.addr = addr;
> + entry.size = size;
> + entry.flags = flags;
> +
> + /*
> + * If our uaccess fails, abort the log so that the end address writeback
> + * does not occur and userspace sees zero accesses.
> + */
> + if (uaccess_buffer_copy_to_user(
> + (void __user *)current->uaccess_buffer.addr, &entry,
> + sizeof(entry))) {
> + current->uaccess_buffer.state = 0;
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> +
> + current->uaccess_buffer.addr += sizeof(entry);
> + current->uaccess_buffer.size -= sizeof(entry);
> +}
> +
> +void uaccess_buffer_log_read(const void __user *from, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)from, n, 0);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_read);
> +
> +void uaccess_buffer_log_write(void __user *to, unsigned long n)
> +{
> + uaccess_buffer_log((unsigned long)to, n, UACCESS_BUFFER_FLAG_WRITE);
> +}
> +EXPORT_SYMBOL(uaccess_buffer_log_write);
> +
> +int uaccess_buffer_set_logging(unsigned long addr, unsigned long size,
> + unsigned long store_end_addr)
> +{
> + sigset_t temp_sigmask;
> +
> + current->uaccess_buffer.addr = addr;
> + current->uaccess_buffer.size = size;
> + current->uaccess_buffer.store_end_addr = store_end_addr;
> +
> + /*
> + * Allow 2 syscalls before resetting the state: the current one (i.e.
> + * prctl) and the next one, whose accesses we want to log.
> + */
> + current->uaccess_buffer.state = 2;
> +
> + /*
> + * Temporarily mask signals so that an intervening asynchronous signal
> + * will not interfere with the logging.
> + */
> + current->uaccess_buffer.saved_sigmask = current->blocked;
> + sigfillset(&temp_sigmask);
> + sigdelsetmask(&temp_sigmask, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + __set_current_blocked(&temp_sigmask);
> +
> + return 0;
> +}
> +
> +void uaccess_buffer_syscall_entry(void)
> +{
> + /*
> + * The current syscall may be e.g. rt_sigprocmask, and therefore we want
> + * to reset the mask before the syscall and not after, so that our
> + * temporary mask is unobservable.
> + */
> + if (current->uaccess_buffer.state == 1)
> + __set_current_blocked(&current->uaccess_buffer.saved_sigmask);
> +}
> +
> +void uaccess_buffer_syscall_exit(void)
> +{
> + if (current->uaccess_buffer.state > 0) {
> + --current->uaccess_buffer.state;
> + if (current->uaccess_buffer.state == 0) {
> + u64 addr64 = current->uaccess_buffer.addr;
> +
> + uaccess_buffer_copy_to_user(
> + (void __user *)
> + current->uaccess_buffer.store_end_addr,
> + &addr64, sizeof(addr64));
> + current->uaccess_buffer.addr = current->uaccess_buffer.size = 0;
> + }
> + }
> +}
> +
> +#endif