[PATCH v2 3/4] rseq: Make rseq work with protection keys

From: Dmitry Vyukov
Date: Tue Feb 18 2025 - 02:47:31 EST


If an application registers rseq, and ever switches to another pkey
protection (such that the rseq becomes inaccessible), then any
context switch will cause failure in __rseq_handle_notify_resume()
attempting to read/write struct rseq and/or rseq_cs. Since context
switches are asynchronous and are outside of the application control
(not part of the restricted code scope), temporarily switch to
permissive pkey register to read/write rseq/rseq_cs, similarly
to signal delivery accesses to altstack.

Signed-off-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
Cc: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: Aruna Ramakrishna <aruna.ramakrishna@xxxxxxxxxx>
Cc: x86@xxxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx

---
Changes in v2:
- fixed typos and reworded the comment
---
kernel/rseq.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 442aba29bc4cf..6fc9f799720cd 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -10,6 +10,7 @@

#include <linux/sched.h>
#include <linux/uaccess.h>
+#include <linux/pkeys.h>
#include <linux/syscalls.h>
#include <linux/rseq.h>
#include <linux/types.h>
@@ -403,10 +404,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
int ret, sig;
+ pkey_reg_t saved;
+ bool switched_pkey_reg = false;

if (unlikely(t->flags & PF_EXITING))
return;

+retry:
/*
* regs is NULL if and only if the caller is in a syscall path. Skip
* fixup and leave rseq_cs as is so that rseq_sycall() will detect and
@@ -419,9 +423,43 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
}
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
+ if (switched_pkey_reg)
+ write_pkey_reg(saved);
return;

error:
+ /*
+ * If the application registers rseq, and ever switches to another
+ * pkey protection (such that the rseq becomes inaccessible), then
+ * any context switch will cause failure here attempting to read/write
+ * struct rseq and/or rseq_cs. Since context switches are
+ * asynchronous and are outside of the application control
+ * (not part of the restricted code scope), temporarily switch
+ * to permissive pkey register to read/write rseq/rseq_cs,
+ * similarly to signal delivery accesses to altstack.
+ *
+ * Don't bother to check if the failure really happened due to
+ * pkeys or not, since it does not matter (performance-wise and
+ * otherwise).
+ *
+ * Note that if code has write access to struct rseq, it may install
+ * rseq_cs that is not accessible to it due to pkeys. Still let this
+ * function read such rseq_cs on behalf of the code circumventing
+ * pkeys protection. It's unclear what benefits the restricted code
+ * gets by doing this (it presumably has already hijacked control
+ * flow at this point, or has arbitrary write primitive to write
+ * arbitrary values to struct rseq). A sane sandbox should prohibit
+ * restricted code from accessing struct rseq. Disabling pkeys
+ * protection is still better than terminating the app
+ * unconditionally.
+ */
+ if (!switched_pkey_reg) {
+ switched_pkey_reg = true;
+ saved = switch_to_permissive_pkey_reg();
+ goto retry;
+ } else {
+ write_pkey_reg(saved);
+ }
sig = ksig ? ksig->sig : 0;
force_sigsegv(sig);
}
--
2.48.1.601.g30ceb7b040-goog