[PATCH RFC v2] Add support for core-wide protection of IRQ and softirq

From: Joel Fernandes (Google)
Date: Wed May 20 2020 - 18:37:36 EST


With current core scheduling patchset, non-threaded IRQ and softirq
victims can leak data from its hyperthread to a sibling hyperthread
running an attacker.

For MDS, it is possible for the IRQ and softirq handlers to leak data to
either host or guest attackers. For L1TF, it is possible to leak to
guest attackers. There is no possible mitigation involving flushing of
buffers to avoid this since the execution of attacker and victims happen
concurrently on 2 or more HTs.

The solution in this patch is to monitor the outer-most core-wide
irq_enter() and irq_exit() executed by any sibling. In between these
two, we mark the core to be in a special core-wide IRQ state.

In the IRQ entry, if we detect that the sibling is running untrusted
code, we send a reschedule IPI so that the sibling transitions through
the sibling's irq_exit() to do any waiting there, till the IRQ being
protected finishes.

We also monitor the per-CPU outer-most irq_exit(). If during the per-cpu
outer-most irq_exit(), the core is still in the special core-wide IRQ
state, we perform a busy-wait till the core exits this state. This
combination of per-cpu and core-wide IRQ states helps to handle any
combination of irq_entry()s and irq_exit()s happening on all of the
siblings of the core in any order.

Lastly, we also check in the schedule loop if we are about to schedule
an untrusted process while the core is in such a state. This is possible
if a trusted thread enters the scheduler by way of yielding CPU. This
would involve no transitions through the irq_exit() point to do any
waiting, so we have to explicitly do the waiting there.

Every attempt is made to prevent a busy-wait unnecessarily, and in
testing on real-world ChromeOS usecases, it has not shown a performance
drop. In ChromeOS, with this and the rest of the core scheduling
patchset, we see around a 300% improvement in key press latencies into
Google docs when Camera streaming is running simulatenously (90th
percentile latency of ~150ms drops to ~50ms).

Cc: Julien Desfossez <jdesfossez@xxxxxxxxxxxxxxxx>
Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Cc: Aaron Lu <aaron.lwe@xxxxxxxxx>
Cc: Aubrey Li <aubrey.li@xxxxxxxxxxxxxxx>
Cc: Tim Chen <tim.c.chen@xxxxxxxxx>
Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
Co-developed-by: Vineeth Pillai <vpillai@xxxxxxxxxxxxxxxx>
Signed-off-by: Vineeth Pillai <vpillai@xxxxxxxxxxxxxxxx>
Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>

---
If you like some pictures of the cases handled by this patch, please
see the OSPM slide deck (the below link jumps straight to relevant
slides - about 6-7 of them in total): https://bit.ly/2zvzxWk

v1->v2:
Fixed a bug where softirq was causing deadlock (thanks Vineeth/Julien)

The issue was because of the following flow:

On CPU0:
local_bh_enable()
-> Enter softirq
-> Softirq takes a lock.
-> <new Interrupt received during softirq>
-> New interrupt's irq_exit() : Wait since it is not outermost
core-wide irq_exit().

On CPU1:
<interrupt received>
irq_enter() -> Enter the core wide IRQ state.
<ISR raises a softirq which will run from irq_exit().
irq_exit() ->
-> enters softirq
-> softirq tries to take a lock and blocks.

So it is an A->B and B->A deadlock.
A = Enter the core-wide IRQ state or wait for it to end.
B = Acquire a lock during softirq or wait for it to be released.

The fix is to enter the core-wide IRQ state even when entering through
the local_bh_enable -> softirq path (When there is no hardirq
context). which basically becomes:

On CPU0:
local_bh_enable()
(Fix: Call sched_core_irq_enter() --> similar to irq_enter()).
-> Enter softirq
-> Softirq takes a lock.
-> <new Interrupt received during softirq> -> irq_enter()
-> New interrupt's irq_exit() (Will not wait since we are inner
irq_exit()).

include/linux/sched.h | 8 +++
kernel/sched/core.c | 159 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 +
kernel/softirq.c | 12 ++++
4 files changed, 182 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 710e9a8956007..fe6ae59fcadbe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2018,4 +2018,12 @@ int sched_trace_rq_cpu(struct rq *rq);

const struct cpumask *sched_trace_rd_span(struct root_domain *rd);

+#ifdef CONFIG_SCHED_CORE
+void sched_core_irq_enter(void);
+void sched_core_irq_exit(void);
+#else
+static void sched_core_irq_enter(void) { }
+static void sched_core_irq_exit(void) { }
+#endif
+
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 21c640170323b..684359ff357e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4391,6 +4391,153 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}

+/*
+ * Helper function to pause the caller's hyperthread until the core exits the
+ * core-wide IRQ state. Obviously the CPU calling this function should not be
+ * responsible for the core being in the core-wide IRQ state otherwise it will
+ * deadlock. This function should be called from irq_exit() and from schedule().
+ * It is upto the callers to decide if calling here is necessary.
+ */
+static inline void sched_core_sibling_irq_pause(struct rq *rq)
+{
+ /*
+ * Wait till the core of this HT is not in a core-wide IRQ state.
+ *
+ * Pair with smp_store_release() in sched_core_irq_exit().
+ */
+ while (smp_load_acquire(&rq->core->core_irq_nest) > 0)
+ cpu_relax();
+}
+
+/*
+ * Enter the core-wide IRQ state. Sibling will be paused if it is running
+ * 'untrusted' code, until sched_core_irq_exit() is called. Every attempt to
+ * avoid sending useless IPIs is made. Must be called only from hard IRQ
+ * context.
+ */
+void sched_core_irq_enter(void)
+{
+ int i, cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ const struct cpumask *smt_mask;
+
+ if (!sched_core_enabled(rq))
+ return;
+
+ /* Count irq_enter() calls received without irq_exit() on this CPU. */
+ rq->core_this_irq_nest++;
+
+ /* If not outermost irq_enter(), do nothing. */
+ if (WARN_ON_ONCE(rq->core->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest != 1)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ smt_mask = cpu_smt_mask(cpu);
+
+ /* Contribute this CPU's irq_enter() to core-wide irq_enter() count. */
+ WRITE_ONCE(rq->core->core_irq_nest, rq->core->core_irq_nest + 1);
+ if (WARN_ON_ONCE(rq->core->core_irq_nest == UINT_MAX))
+ goto unlock;
+
+ if (rq->core_pause_pending) {
+ /*
+ * Do nothing more since we are in a 'reschedule IPI' sent from
+ * another sibling. That sibling would have sent IPIs to all of
+ * the HTs.
+ */
+ goto unlock;
+ }
+
+ /*
+ * If we are not the first ones on the core to enter core-wide IRQ
+ * state, do nothing.
+ */
+ if (rq->core->core_irq_nest > 1)
+ goto unlock;
+
+ /* Do nothing more if the core is not tagged. */
+ if (!rq->core->core_cookie)
+ goto unlock;
+
+ for_each_cpu(i, smt_mask) {
+ struct rq *srq = cpu_rq(i);
+
+ if (i == cpu || cpu_is_offline(i))
+ continue;
+
+ if (!srq->curr->mm || is_idle_task(srq->curr))
+ continue;
+
+ /* Skip if HT is not running a tagged task. */
+ if (!srq->curr->core_cookie && !srq->core_pick)
+ continue;
+
+ /* IPI only if previous IPI was not pending. */
+ if (!srq->core_pause_pending) {
+ srq->core_pause_pending = 1;
+ smp_send_reschedule(i);
+ }
+ }
+unlock:
+ raw_spin_unlock(rq_lockp(rq));
+}
+
+/*
+ * Process any work need for either exiting the core-wide IRQ state, or for
+ * waiting on this hyperthread if the core is still in this state.
+ */
+void sched_core_irq_exit(void)
+{
+ int cpu = smp_processor_id();
+ struct rq *rq = cpu_rq(cpu);
+ bool wait_here = false;
+ unsigned int nest;
+
+ /* Do nothing if core-sched disabled. */
+ if (!sched_core_enabled(rq))
+ return;
+
+ rq->core_this_irq_nest--;
+
+ /* If not outermost on this CPU, do nothing. */
+ if (WARN_ON_ONCE(rq->core_this_irq_nest == UINT_MAX) ||
+ rq->core_this_irq_nest > 0)
+ return;
+
+ raw_spin_lock(rq_lockp(rq));
+ /*
+ * Core-wide nesting counter can never be 0 because we are
+ * still in it on this CPU.
+ */
+ nest = rq->core->core_irq_nest;
+ WARN_ON_ONCE(!nest);
+
+ /*
+ * If we still have other CPUs in IRQs, we have to wait for them.
+ * Either here, or in the scheduler.
+ */
+ if (rq->core->core_cookie && nest > 1) {
+ /*
+ * If we are entering the scheduler anyway, we can just wait
+ * there for ->core_irq_nest to reach 0. If not, just wait here.
+ */
+ if (!tif_need_resched()) {
+ wait_here = true;
+ }
+ }
+
+ if (rq->core_pause_pending)
+ rq->core_pause_pending = 0;
+
+ /* Pair with smp_load_acquire() in sched_core_sibling_irq_pause(). */
+ smp_store_release(&rq->core->core_irq_nest, nest - 1);
+ raw_spin_unlock(rq_lockp(rq));
+
+ if (wait_here)
+ sched_core_sibling_irq_pause(rq);
+}
+
// XXX fairness/fwd progress conditions
/*
* Returns
@@ -4910,6 +5057,18 @@ static void __sched notrace __schedule(bool preempt)
rq_unlock_irq(rq, &rf);
}

+#ifdef CONFIG_SCHED_CORE
+ /*
+ * If a CPU that was running a trusted task entered the scheduler, and
+ * the next task is untrusted, then check if waiting for core-wide IRQ
+ * state to cease is needed since we would not have been able to get
+ * the services of irq_exit() to do that waiting.
+ */
+ if (sched_core_enabled(rq) &&
+ !is_idle_task(next) && next->mm && next->core_cookie)
+ sched_core_sibling_irq_pause(rq);
+#endif
+
balance_callback(rq);
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a7d9f156242e2..3a065d133ef51 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1018,11 +1018,14 @@ struct rq {
unsigned int core_sched_seq;
struct rb_root core_tree;
unsigned char core_forceidle;
+ unsigned char core_pause_pending;
+ unsigned int core_this_irq_nest;

/* shared state */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
+ unsigned int core_irq_nest;
#endif
};

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 0427a86743a46..147abd6d82599 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -273,6 +273,13 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);

+ /*
+ * Core scheduling mitigations require entry into softirq to send stall
+ * IPIs to sibling hyperthreads if needed (ex, sibling is running
+ * untrusted task). If we are here from irq_exit(), no IPIs are sent.
+ */
+ sched_core_irq_enter();
+
local_irq_enable();

h = softirq_vec;
@@ -305,6 +312,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
rcu_softirq_qs();
local_irq_disable();

+ /* Inform the scheduler about exit from softirq. */
+ sched_core_irq_exit();
+
pending = local_softirq_pending();
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
@@ -345,6 +355,7 @@ asmlinkage __visible void do_softirq(void)
void irq_enter(void)
{
rcu_irq_enter();
+ sched_core_irq_enter();
if (is_idle_task(current) && !in_interrupt()) {
/*
* Prevent raise_softirq from needlessly waking up ksoftirqd
@@ -413,6 +424,7 @@ void irq_exit(void)
invoke_softirq();

tick_irq_exit();
+ sched_core_irq_exit();
rcu_irq_exit();
trace_hardirq_exit(); /* must be last! */
}
--
2.26.2.761.g0e0b3e54be-goog