[PATCH v2 2/2] x86/resctrl: Don't workqueue local event counter reads
From: Peter Newman
Date: Wed Nov 06 2024 - 10:44:07 EST
Performance-conscious users may use threads bound to CPUs within a
specific monitoring domain to ensure that all bandwidth counters can be
read efficiently. The hardware counters are only accessible to CPUs
within the domain, so requests from CPUs outside the domain are
forwarded to a kernel worker or IPI handler, incurring a substantial
performance penalty on each read. Recently, this penalty was observed
to be paid by local reads as well.
To support blocking implementations of resctrl_arch_rmid_read(),
mon_event_read() switched to smp_call_on_cpu() in most cases to read
event counters using a kernel worker thread. Unlike
smp_call_function_any(), which optimizes to a local function call when
the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
work unconditionally.
Introduce resctrl_arch_event_read_blocks() to allow the implementation
to indicate whether reading a particular event counter blocks. Use this
to limit the usage of smp_call_on_cpu() to only the counters where it is
actually needed. This reverts to the previous behavior of always using
smp_call_function_any() for all x86 implementations.
This is significant when supporting configurations such as a dual-socket
AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
counters for all groups on all domains requires 32768 (32*256*2) counter
reads. The resolution of global, per-group MBM data which can be
provided is therefore sensitive to the cost of each counter read.
Furthermore, redirecting this much work to IPI handlers or worker
threads at a regular interval is disruptive to the present workload.
The test program fastcat, which was introduced in an earlier path, was
used to simulate the impact of this change on an optimized event
counter-reading procedure. The goal is to maximize the frequency at
which MBM counters can be dumped, so the benchmark determines the cost
of an additional global MBM counter sample.
The total number of cycles needed to read all local and total MBM
counters for a large number of monitoring groups was collected using the
perf tool. The average over 100 iterations is given, with a 1-second
sleep between iterations to better represent the intended use case. The
test was run bound to the CPUs of a single MBM domain, once targeting
counters in the local domain and again for counters in a remote domain.
AMD EPYC 7B12 64-Core Processor (250 mon groups)
Local Domain: 5.72M -> 1.22M (-78.7%)
Remote Domain: 5.89M -> 5.20M (-11.8%)
Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz (220 mon groups)
Local Domain: 3.37M -> 2.52M (-25.4%)
Remote Domain: 5.16M -> 5.79M (+12.0%)
The slowdown for remote domain reads on Intel is worrying, but since
this change is effectively a revert to old behavior on x86, this
shouldn't be anything new.
Also note that the Remote Domain results and the baseline Local Domain
results only measure cycles in the test program. Because all counter
reading work was carried out in kernel worker threads or IPI handlers,
the total system cost of the operation is greater.
Fixes: 09909e098113 ("x86/resctrl: Queue mon_event_read() instead of sending an IPI")
Signed-off-by: Peter Newman <peternewman@xxxxxxxxxx>
---
v1: https://lore.kernel.org/lkml/20241031142553.3963058-2-peternewman@xxxxxxxxxx/
---
arch/x86/include/asm/resctrl.h | 7 +++++++
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 8 +++++++-
2 files changed, 14 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 8b1b6ce1e51b2..8696c0c0e1df4 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -178,6 +178,13 @@ static inline void *resctrl_arch_mon_ctx_alloc(struct rdt_resource *r, int evtid
static inline void resctrl_arch_mon_ctx_free(struct rdt_resource *r, int evtid,
void *ctx) { };
+static inline bool resctrl_arch_event_read_blocks(struct rdt_resource *r,
+ int evtid)
+{
+ /* all events can be read without blocking */
+ return false;
+}
+
void resctrl_cpu_detect(struct cpuinfo_x86 *c);
#else
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 200d89a640270..395bcc5362f4e 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -548,8 +548,14 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
* are all the CPUs nohz_full? If yes, pick a CPU to IPI.
* MPAM's resctrl_arch_rmid_read() is unable to read the
* counters on some platforms if its called in IRQ context.
+ *
+ * smp_call_on_cpu() dispatches to a kernel worker
+ * unconditionally, even when the event can be read much more
+ * efficiently on the current CPU, so only use it when
+ * blocking is required.
*/
- if (tick_nohz_full_cpu(cpu))
+ if (tick_nohz_full_cpu(cpu) ||
+ !resctrl_arch_event_read_blocks(r, evtid))
smp_call_function_any(cpumask, mon_event_count, rr, 1);
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
--
2.47.0.199.ga7371fff76-goog