[PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads
From: Peter Newman
Date: Thu Oct 31 2024 - 10:26:35 EST
Performance-conscious users may use threads bound to CPUs within a
specific monitoring domain to ensure that all bandwidth counters can be
read efficiently. The hardware counters are only accessible to CPUs
within the domain, so requests from CPUs outside the domain are
forwarded to a kernel worker or IPI handler, incurring a substantial
performance penalty on each read. Recently, this penalty was observed
to be paid by local reads as well.
To support blocking implementations of resctrl_arch_rmid_read(),
mon_event_read() switched to smp_call_on_cpu() in most cases to read
event counters using a kernel worker thread. Unlike
smp_call_function_any(), which optimizes to a local function call when
the calling CPU is in the target cpumask, smp_call_on_cpu() queues the
work unconditionally.
Add a fast-path to ensure that requests bound to within the monitoring
domain are read using a simple function call into mon_event_count()
regardless of whether all CPUs in the target domain are using nohz_full.
This is significant when supporting configurations such as a dual-socket
AMD Zen2, with 32 L3 monitoring domains and 256 RMIDs. To read both MBM
counters for all groups on all domains requires 32768 (32*256*2) counter
reads. The resolution of global, per-group MBM data which can be
provided is therefore sensitive to the cost of each counter read.
Furthermore, redirecting this much work to IPI handlers or worker
threads at a regular interval is disruptive to the present workload.
The test program fastcat, which was introduced in an earlier path, was
used to simulate the impact of this change on an optimized event
counter-reading procedure. The goal is to maximize the frequency at
which MBM counters can be dumped, so the benchmark determines the cost
of an additional global MBM counter sample.
The total number of cycles needed to read all local and total MBM
counters for a large number of monitoring groups was collected using the
perf tool. The test was run bound to a single CPU: once targeting
counters in the local domain and again for counters in a remote domain.
The cost of a dry-run reading no counters was substracted from the total
of each run to remove one-time setup costs.
AMD EPYC 7B12 64-Core Processor (250 mon groups)
Local Domain: 3.25M -> 1.22M (-62.5%)
Remote Domain: 7.91M -> 8.05M (+2.9%)
Intel(R) Xeon(R) Gold 6268CL CPU @ 2.80GHz (190 mon groups)
Local Domain: 2.98M -> 2.21M (-25.8%)
Remote Domain: 4.49M -> 4.62M (+3.1%)
Note that there is a small increase in overhead for remote domains,
which results from the introduction of a put_cpu() call to reenable
preemption after determining whether the fast path can be used. Users
sensitive to this cost should consider avoiding the remote counter read
penalty completely.
Also note that the Remote Domain results and the baseline Local Domain
results only measure cycles in the test program. Because all counter
reading work was carried out in kernel worker threads, the total system
cost of the operation is greater.
Fixes: 09909e098113 ("x86/resctrl: Queue mon_event_read() instead of sending an IPI")
Signed-off-by: Peter Newman <peternewman@xxxxxxxxxx>
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 28 +++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 200d89a640270..daaff1cfd3f24 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -541,6 +541,31 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
return;
}
+ /*
+ * If a performance-conscious caller has gone to the trouble of binding
+ * their thread to the monitoring domain of the event counter, ensure
+ * that the counters are read directly. smp_call_on_cpu()
+ * unconditionally uses a work queue to read the counter, substantially
+ * increasing the cost of the read.
+ *
+ * Preemption must be disabled to prevent a migration out of the domain
+ * after the CPU is checked, which would result in reading the wrong
+ * counters. Note that this makes the (slow) remote path a little slower
+ * by requiring preemption to be reenabled when redirecting the request
+ * to another domain was in fact necessary.
+ *
+ * In the case where all eligible target CPUs are nohz_full and
+ * smp_call_function_any() is used, keep preemption disabled to avoid
+ * the cost of reenabling it twice in the same read.
+ */
+ cpu = get_cpu();
+ if (cpumask_test_cpu(cpu, cpumask)) {
+ mon_event_count(rr);
+ resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
+ put_cpu();
+ return;
+ }
+
cpu = cpumask_any_housekeeping(cpumask, RESCTRL_PICK_ANY_CPU);
/*
@@ -554,6 +579,9 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
else
smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
+ /* If smp_call_function_any() was used, preemption is reenabled here. */
+ put_cpu();
+
resctrl_arch_mon_ctx_free(r, evtid, rr->arch_mon_ctx);
}
--
2.47.0.199.ga7371fff76-goog