Re: [PATCH 0/4] riscv: uaccess: optimizations

From: Linus Torvalds
Date: Sun Jun 30 2024 - 13:00:10 EST


On Tue, 25 Jun 2024 at 11:12, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> But yes, it would be lovely if we did things as "implement the
> low-level accessor functions and we'll wrap them for the generic case"
> rather than have every architecture have to do the wrapping..

Btw, to do that _well_, we need to expand on the user access functions
a bit more.

In particular, we can already implement "get_user()" on top of
user_access_begin() and friends something like this:

#define get_user(x,ptr) ({ \
__label__ Efault_read; \
long __err; \
__typeof__(ptr) __ptr = (ptr); \
if (likely(user_access_begin(__ptr, sizeof(x)))) { \
unsafe_get_user(x, __ptr, Efault_read); \
user_access_end(); \
__err = 0; \
} else { \
if (0) { \
Efault_read: user_access_end(); \
} \
x = (__typeof__(x))0; \
__err = -EFAULT; \
} \
__err; })

and it doesn't generate _horrible_ code. It looks pretty bad, but all
the error handling goes out-of-line, so on x86-64 (without debug
options) it generates code something like this:

test %rdi,%rdi
js <cap_validate_magic+0x98>
stac
lfence
mov (%rdi),%ecx
clac

which is certainly not horrid. But it's not great, because that lfence
ends up really screwing up the pipeline.

The manually coded out-of-line code generates this instead:

mov %rax,%rdx
sar $0x3f,%rdx
or %rdx,%rax
stac
movzbl (%rax),%edx
xor %eax,%eax
clac
ret

because it doesn't do a conditional branch (with the required spectre
thing), but instead does the address as a data dependency and knows
that "all bits set" if the address was negative will cause a page
fault.

But we *can* get the user accesses to do the same with a slight
expansion of user_access_begin():

stac
mov %rdi,%rax
sar $0x3f,%rax
or %rdi,%rax
mov (%rax),%eax
clac

by just introducing a notion of "masked_user_access". The get_user()
macro part would look like this:

__typeof__(ptr) __ptr; \
__ptr = masked_user_access_begin(ptr); \
if (1) { \
unsafe_get_user(x, __ptr, Efault_read); \
user_access_end(); \

and the patch to implement this is attached. I've had it around for a
while, but I don't know how many architectures can do this.

Note this part of the commit message:

This model only works for dense accesses that start at 'src' and on
architectures that have a guard region that is guaranteed to fault in
between the user space and the kernel space area.

which is true on x86-64, but that "guard region" thing might not be
true everywhere.

Will/Catalin - would that

src = masked_user_access_begin(src);

work on arm64? The code does do something like that with
__uaccess_mask_ptr() already, but at least currently it doesn't do the
"avoid conditional entirely", the masking is just in _addition_ to the
access_ok().

Linus
From 6b2c9a69efc21b9e6e0497a5661273f6fbe204b2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Mon, 8 Apr 2024 20:04:58 -0700
Subject: [PATCH] x86: support user address masking instead of non-speculative
conditional

The Spectre-v1 mitigations made "access_ok()" much more expensive, since
it has to serialize execution with the test for a valid user address.

All the normal user copy routines avoid this by just masking the user
address with a data-dependent mask instead, but the fast
"unsafe_user_read()" kind of patterms that were supposed to be a fast
case got slowed down.

This introduces a notion of using

src = masked_user_access_begin(src);

to do the user address sanity using a data-dependent mask instead of the
more traditional conditional

if (user_read_access_begin(src, len)) {

model.

This model only works for dense accesses that start at 'src' and on
architectures that have a guard region that is guaranteed to fault in
between the user space and the kernel space area.

With this, the user access doesn't need to be manually checked, because
a bad address is guaranteed to fault (by some architecture masking
trick: on x86-64 this involves just turning an invalid user address into
all ones, since we don't map the top of address space).

This only converts a couple of examples for now. Example x86-64 code
generation for loading two words from user space:

stac
mov %rax,%rcx
sar $0x3f,%rcx
or %rax,%rcx
mov (%rcx),%r13
mov 0x8(%rcx),%r14
clac

where all the error handling and -EFAULT is now purely handled out of
line by the exception path.

Of course, if the micro-architecture does badly at 'clac' and 'stac',
the above is still pitifully slow. But at least we did as well as we
could.

Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
---
arch/x86/include/asm/uaccess_64.h | 8 ++++++++
fs/select.c | 4 +++-
include/linux/uaccess.h | 7 +++++++
lib/strncpy_from_user.c | 9 +++++++++
lib/strnlen_user.c | 9 +++++++++
5 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 04789f45ab2b..a10149a96d9e 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -53,6 +53,14 @@ static inline unsigned long __untagged_addr_remote(struct mm_struct *mm,
*/
#define valid_user_address(x) ((__force long)(x) >= 0)

+/*
+ * Masking the user address is an alternative to a conditional
+ * user_access_begin that can avoid the fencing. This only works
+ * for dense accesses starting at the address.
+ */
+#define mask_user_address(x) ((typeof(x))((long)(x)|((long)(x)>>63)))
+#define masked_user_access_begin(x) ({ __uaccess_begin(); mask_user_address(x); })
+
/*
* User pointers can have tag bits on x86-64. This scheme tolerates
* arbitrary values in those bits rather then masking them off.
diff --git a/fs/select.c b/fs/select.c
index 9515c3fa1a03..bc185d111436 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -780,7 +780,9 @@ static inline int get_sigset_argpack(struct sigset_argpack *to,
{
// the path is hot enough for overhead of copy_from_user() to matter
if (from) {
- if (!user_read_access_begin(from, sizeof(*from)))
+ if (can_do_masked_user_access())
+ from = masked_user_access_begin(from);
+ else if (!user_read_access_begin(from, sizeof(*from)))
return -EFAULT;
unsafe_get_user(to->p, &from->p, Efault);
unsafe_get_user(to->size, &from->size, Efault);
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 3064314f4832..f18371f6cf36 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -32,6 +32,13 @@
})
#endif

+#ifdef masked_user_access_begin
+ #define can_do_masked_user_access() 1
+#else
+ #define can_do_masked_user_access() 0
+ #define masked_user_access_begin(src) NULL
+#endif
+
/*
* Architectures should provide two primitives (raw_copy_{to,from}_user())
* and get rid of their private instances of copy_{to,from}_user() and
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 6432b8c3e431..989a12a67872 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -120,6 +120,15 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
if (unlikely(count <= 0))
return 0;

+ if (can_do_masked_user_access()) {
+ long retval;
+
+ src = masked_user_access_begin(src);
+ retval = do_strncpy_from_user(dst, src, count, count);
+ user_read_access_end();
+ return retval;
+ }
+
max_addr = TASK_SIZE_MAX;
src_addr = (unsigned long)untagged_addr(src);
if (likely(src_addr < max_addr)) {
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index feeb935a2299..6e489f9e90f1 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -96,6 +96,15 @@ long strnlen_user(const char __user *str, long count)
if (unlikely(count <= 0))
return 0;

+ if (can_do_masked_user_access()) {
+ long retval;
+
+ str = masked_user_access_begin(str);
+ retval = do_strnlen_user(str, count, count);
+ user_read_access_end();
+ return retval;
+ }
+
max_addr = TASK_SIZE_MAX;
src_addr = (unsigned long)untagged_addr(str);
if (likely(src_addr < max_addr)) {