Re: [PATCH] kernel: introduce prctl(PR_LOG_UACCESS)

From: Peter Collingbourne
Date: Fri Sep 24 2021 - 17:50:21 EST

On Wed, Sep 22, 2021 at 8:59 AM Jann Horn <jannh@xxxxxxxxxx> wrote:
> On Wed, Sep 22, 2021 at 5:30 PM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
> > On Wed, Sep 22, 2021 at 09:23:10AM -0500, Eric W. Biederman wrote:
> > > Peter Collingbourne <pcc@xxxxxxxxxx> writes:
> > >
> > > > This patch introduces a kernel feature known as uaccess logging.
> > > > With uaccess logging, the userspace program passes the address and size
> > > > of a so-called uaccess buffer to the kernel via a prctl(). The prctl()
> > > > is a request for the kernel to log any uaccesses made during the next
> > > > syscall to the uaccess buffer. When the next syscall returns, the address
> > > > one past the end of the logged uaccess buffer entries is written to the
> > > > location specified by the third argument to the prctl(). In this way,
> > > > the userspace program may enumerate the uaccesses logged to the access
> > > > buffer to determine which accesses occurred.
> > > > [...]
> > > > 3) Kernel fuzzing. We may use the list of reported kernel accesses to
> > > > guide a kernel fuzzing tool such as syzkaller (so that it knows which
> > > > parts of user memory to fuzz), as an alternative to providing the tool
> > > > with a list of syscalls and their uaccesses (which again thanks to
> > > > (2) may not be accurate).
> > >
> > > How is logging the kernel's activity like this not a significant
> > > information leak? How is this safe for unprivileged users?
> >
> > This does result in userspace being able to "watch" the kernel progress
> > through a syscall. I'd say it's less dangerous than userfaultfd, but
> > still worrisome. (And userfaultfd is normally disabled[1] for unprivileged
> > users trying to interpose the kernel accessing user memory.)
> >
> > Regardless, this is a pretty useful tool for this kind of fuzzing.
> > Perhaps the timing exposure could be mitigated by having the kernel
> > collect the record in a separate kernel-allocated buffer and flush the
> > results to userspace at syscall exit? (This would solve the
> > copy_to_user() recursion issue too.)

Seems reasonable. I suppose that in terms of timing information we're
already (unavoidably) exposing how long the syscall took overall, and
we probably shouldn't deliberately expose more than that.

That being said, I'm wondering if that has security implications on
its own if it's then possible for userspace to manipulate the kernel
into allocating a large buffer (either at prctl() time or as a result
of getting the kernel to do a large number of uaccesses). Perhaps it
can be mitigated by limiting the size of the uaccess buffer provided
at prctl() time.

> Other than what Kees has already said, the only security concern I
> have with that patch should be trivial to fix: If the ->uaccess_buffer
> machinery writes to current's memory, it must be reset during
> execve(), before switching to the new mm, to prevent the old task from
> causing the kernel to scribble into the new mm.

Yes, that's a good point. I'll fix that in the next version.

> One aspect that might benefit from some clarification on intended
> behavior is: what should happen if there are BPF tracing programs
> running (possibly as part of some kind of system-wide profiling or
> such) that poke around in userspace memory with BPF's uaccess helpers
> (especially "bpf_copy_from_user")?

I think we should probably be ignoring those accesses, since we cannot
know a priori whether the accesses are directly associated with the
syscall or not, and this is after all a best-effort mechanism.

> > I'm pondering what else might be getting exposed by creating this level
> > of probing... kernel addresses would already be getting rejected, so
> > they wouldn't show up in the buffer. Hmm. Jann, any thoughts here?
> >
> >
> > Some other thoughts:
> >
> >
> > Instead of reimplementing copy_*_user() with a new wrapper that
> > bypasses some checks and adds others and has to stay in sync, etc,
> > how about just adding a "recursion" flag? Something like:
> >
> > copy_from_user(...)
> > instrument_copy_from_user(...)
> > uaccess_buffer_log_read(...)
> > if (current->uaccess_buffer.writing)
> > return;
> > uaccess_buffer_log(...)
> > current->uaccess_buffer.writing = true;
> > copy_to_user(...)
> > current->uaccess_buffer.writing = false;
> >
> >
> > How about using this via seccomp instead of a per-syscall prctl? This
> > would mean you would have very specific control over which syscalls
> > should get the uaccess tracing, and wouldn't need to deal with
> > the signal mask (I think). I would imagine something similar to
> > add a new top-level seccomp command, (like SECCOMP_GET_NOTIF_SIZES)
> >
> > This would likely only make sense for SECCOMP_RET_TRACE or _TRAP if the
> > program wants to collect the results after every syscall. And maybe this
> > won't make any sense across exec (losing the mm that was used during
> And then I guess your plan would be that userspace would be expected
> to use the userspace instruction pointer
> (seccomp_data::instruction_pointer) to indicate instructions that
> should be traced?
> Or instead of seccomp, you could do it kinda like
> , with a prctl that specifies a specific instruction pointer?

Given a choice between these two options, I would prefer the prctl()
because userspace programs may already be using seccomp filters and
sanitizers shouldn't interfere with it.

However, in either the seccomp filter or prctl() case, you still have
the problem of deciding where to log to. Keep in mind that you would
need to prevent intervening async signals (that occur between when the
syscall happens and when we read the log) from triggering additional
syscalls that may overwrite the log (as a result of using the same
syscall wrapper). This implies that the log location would need to be
per-syscall, so the location would need to be on the stack (or
equivalent, like a heap-allocated buffer with lifetime tied to that of
the syscall wrapper function). So then, how do you notify the kernel
of that location? One possibility is arch-specific augmentation of the
syscall calling convention, as I mentioned in the initial message. But
then if you're doing that, you might as well dispense with the seccomp
filter or prctl() entirely and have the request for a log be
communicated purely via the calling convention.

It seems like any approach which avoids the multiple syscalls is going
to be arch-specific to some degree. So I think I would prefer to start
with a "slow but simple" API, and let architectures provide their own
arch-specific mechanisms if they wish to optimize it.