[RFC][CFT][PATCHSET v1] uaccess unification

From: Al Viro
Date: Wed Mar 29 2017 - 01:57:53 EST


We have several primitives for bulk kernel<->userland copying.
That stuff lives in various asm/uaccess.h, with serious code duplication
_and_ seriously inconsistent semantics.

That code has grown a lot of cruft and more than a few bugs.
Some got caught and fixed last year, but some fairly unpleasant ones
still remain. A large part of problem was that a lot of code used to
include <asm/uaccess.h> directly, so we had no single place to work
with. That got finally fixed in 4.10-rc1, when everything had been
forcibly switched to including <linux/uaccess.h>. At that point it
became possible to start getting rid of boilerplate; I hoped to deal
with that by 4.11-rc1, but the things didn't work out and that work
has slipped to this cycle.

The patchset currently in vfs.git#work.uaccess is the result;
there's more work to do, but it takes care of a large part of the
problems. About 2.8KLoc removed, a lot of cruft is gone and semantics
is hopefully in sync now. All but two architectures (ia64 and metag)
had been switched to new mechanism; for these two I'm afraid that I'll
need serious help from maintainers.

Currently we have 8 primitives - 6 on every architecture and 2 more
on biarch ones. All of them have the same calling conventions: arguments
are the same as for memcpy() (void *to, const void *from, unsigned long size)
and the same rules for return value.
If all loads and stores succeed, everything is obvious - the
'size' bytes starting at 'to' become equal to 'size' bytes starting at 'from'
and zero is returned. If some loads or stores fail, non-zero value should
be returned. If any of those primitives returns a positive value N,
* N should be no greater than size
* the values fetched out of from[0..size-N-1] should be stored into the
corresponding bytes of to[0..size-N-1]
* N should not be equal to size unless not a single byte could have
been fetched or stored. As long as that restriction is satisfied, these
primitives are not required to squeeze every possible byte in case some
loads or stores fail.

1) copy_from_user() - 'to' points to kernel memory, 'from' is
normally a userland pointer. This is used for copying structures from
userland in all kinds of ioctls, etc. No faults on access to destination are
allowed, faults on access to source lead to zero-padding the rest of
destination. Note that for architectures with the same address space split
between the kernel and userland (i.e. the ones that have non-trivial
access_ok()) passing a kernel address instead of a userland one should be
treated as 'every access would fail'. In such cases the entire destination
should be zeroed (failure to do so was a fairly common bug).
Note that all these functions, including copy_from_user(), are
affected by set_fs() - when called under set_fs(KERNEL_DS), they expect
kernel pointers where normally a userland one would be given.

2) copy_to_user() - 'from' points to kernel memory, 'to' is
a userland pointer (subject to set_fs() effects, as usual). Again.
this is used by all kinds of code in all kinds of drivers, syscalls, etc.
No faults on access to source, fault on access to destination terminates
copying. No zero-padding, of course - the faults are going to be on store
here. Does not assume that access_ok() had been checked by caller;
given 'to'/'size' that fails access_ok() returns "nothing copied".

3) copy_in_user() - both 'from' and 'to' are in userland. Used
only by compat code that needs to repack 32bit data structures into native
64bit counterparts. As the result, provided only by biarch architectures.
Subject to set_fs(), but really should not be (and AFAICS isn't) used that way.
Some architectures tried to zero-pad, but did it inconsistently and it's
pointless anyway - destination is in userland memory, so no infoleaks would
happen.

4) __copy_from_user_inatomic() - similar to copy_from_user(),
except that
* the caller is presumed to have verified that the source range passes
access_ok() [note that this is does not guarantee the lack of faults]
* most importantly, zero-padding SHOULD NOT happen on short copy.
If implementation fails to guarantee that, it's a bug and potentially
bad one[1].
* it may be called with pagefaults disabled (of course, in
that case any pagefault results in a short copy). That's what 'inatomic'
in the name refers to. Note that actually disabling pagefaults is
up to the caller; blindly calling it e.g. from under a spinlock will just
get you a deadlock. Even more precautions are needed to call it from
something like an interrupt handler - you must do that under set_fs(),
etc. It's not "this variant is safe to call from atomic contexts", it's
"I know what I'm doing, don't worry if you see it in an atomic context".

5) __copy_to_user_inatomic(). A counterpart of
__copy_from_user_inatomic(), except for the direction of copying.

6) __copy_from_user(). Essentially the only difference from
__copy_from_user_inatomic() is that one isn't supposed to call it from
atomic contexts. It may be marginally faster than copy_from_user() (due
to skipped access_ok()), but these days the main costs are not in doing
fairly light arithmetics. In theory, you might do a single access_ok()
covering a large structure and then proceed to call __copy_from_user()
on various parts of that. In practice doing many calls of that thing on
small chunks of data is going to cost a lot on current x86 boxen due to
STAC/CLAC pair inside each call. Has fewer call sites than copy_from_user()
- copy_from_user() is in thousands, while this one has only 40 callers
outside of arch/, some fairly dubious. In arch there's about 170 callers
total, mostly in sigreturn instances.

7) __copy_to_user(). A counterpart of __copy_from_user(), with
pretty much the same considerations applied.

8) __copy_in_user(). Basically, copy_in_user() sans access_ok().
Biarch-only, with the grand total of 6 callers...


What this series does is:

* convert architectures to fewer primitives (raw_copy_{to,from,in}_user(),
the last one only on biarch ones), switching to generic implementations
of the 8 primitives aboves via raw_... ones. Those generic implementations
are in linux/uaccess.h (and lib/usercopy.c). Architecture provides
raw_... ones, selects ARCH_HAS_RAW_COPY_USER and it's done.

* all object size check, kasan, etc. instrumentation is taken care of
in linux/uaccess.h; no need to touch it in arch/*

* consistent semantics wrt zero-padding - none of the raw_... do any of
that, copy_from_user() does (outside of fast path).

At the moment I have that conversion done for everything except ia64 and
metag. Once everything is converted, I'll remove ARCH_HAS_RAW_COPY_USER
and make generic stuff unconditional; at the same point
HAVE_ARCH_HARDENED_USERCOPY will be gone (becoming unconditionally true).

The series lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
in #work.uaccess. It's based at 4.11-rc1. Infrastructure is in
#uaccess.stem, then it splits into per-architecture branches (uaccess.<arch>),
eventually merged into #work.uaccess. Some stuff (including a cherry-picked
mips build fix) is in #uaccess.misc, also merged into the final.

I hope that infrastructure part is stable enough to put it into never-rebased
state. Some of per-architecture branches might be even done right; however,
most of them got no testing whatsoever, so any help with testing (as well
as "Al, for fuck sake, dump that garbage of yours, here's the correct patch"
from maintainers) would be very welcome. So would the review, of course.

In particular, the fix in uaccess.parisc should be replaced with the stuff
Helge posted on parisc list, probably along with the get_user/put_user
patches. I've put my variant of fix there as a stopgap; switch of pa_memcpy()
to assembler is clearly the right way to solve it and I'll be happy to
switch to that as soon as parisc folks settle on the final version of that
stuff.

For most of the oddball architectures I have no way to test that stuff, so
please treat the asm-affecting patches in there as a starting point for
doing it right. Some might even work as is - stranger things had happened,
but don't count ont it.

And again, metag and ia64 parts are simply not there - both architectures
zero-pad in __copy_from_user_inatomic() and that really needs fixing.
In case of metag there's __copy_to_user() breakage as well, AFAICS, and
I've been unable to find any documentation describing the architecture
wrt exceptions, and that part is apparently fairly weird. In case of
ia64... I can test mckinley side of things, but not the generic __copy_user()
and ia64 is about as weird as it gets. With no reliable emulator, at that...
So these two are up to respective maintainers.

Other things not there:
* unification of strncpy_from_user() and friends. Probably next
cycle.
* anything to do with uaccess_begin/unsafe accesses/uaccess_end
stuff. Definitely next cycle.

I'm not sure if mailbombing linux-arch would be a good idea; there are
90 patches in that pile, with total size nearly half a megabyte. If anyone
wants that posted, I'll do so, but it might be more convenient to just
use git.

Comments, review, testing, replacement patches, etc. are very welcome.

Al "hates assembers, dozens of them" Viro


[1] Nick Piggin has spotted that bug back in early 2000s, fixed it for
i386 and hadn't bothered to do anything about other architectures (including
amd64, for crying out loud!). Since then we had inconsistent behaviour
between the architectures. Results of those bugs range from transient bogus
values observed in mmap() if you get memory pressure combined with bad timing
to outright fs corruption, if the timing is *really* bad. All architectures
used to have it, hopefully this series will take care of the last stragglers.