[PATCH 2/2] x86/resctrl: Add tracepoint for llc_occupancy tracking

From: Haifeng Xu
Date: Thu Feb 29 2024 - 02:12:20 EST


In our production environment, after removing monitor groups, those unused
RMIDs get stuck in the limbo list forever because their llc_occupancy are
always larger than the threshold. But the unused RMIDs can be successfully
freed by turning up the threshold.

In order to know how much the threshold should be, perf can be used to acquire
the llc_occupancy of RMIDs in each rdt domain.

Instead of using perf tool to track llc_occupancy and filter the log manually,
it is more convenient for users to use tracepoint to do this work. So add a new
tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo
list.

Signed-off-by: Haifeng Xu <haifeng.xu@xxxxxxxxxx>
Suggested-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 8 ++++++++
arch/x86/kernel/cpu/resctrl/trace.h | 15 +++++++++++++++
2 files changed, 23 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c34a35ec0f03..ada392ca75b2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -24,6 +24,7 @@
#include <asm/resctrl.h>

#include "internal.h"
+#include "trace.h"

/**
* struct rmid_entry - dirty tracking for all RMID.
@@ -362,6 +363,13 @@ void __check_limbo(struct rdt_domain *d, bool force_free)
limbo_release_entry(entry);
}
cur_idx = idx + 1;
+
+ /* x86's CLOSID and RMID are independent numbers, so the entry's
+ * closid is a invalid CLOSID. But on arm64, the RMID value isn't
+ * a unique number for each CLOSID. It's necessary to track both
+ * CLOSID and RMID because there may be dependencies between each
+ * other on some architectures */
+ trace_mon_llc_occupancy_limbo(entry->closid, entry->rmid, d->id, val);
}

resctrl_arch_mon_ctx_free(r, QOS_L3_OCCUP_EVENT_ID, arch_mon_ctx);
diff --git a/arch/x86/kernel/cpu/resctrl/trace.h b/arch/x86/kernel/cpu/resctrl/trace.h
index 495fb90c8572..35149a75c951 100644
--- a/arch/x86/kernel/cpu/resctrl/trace.h
+++ b/arch/x86/kernel/cpu/resctrl/trace.h
@@ -35,6 +35,21 @@ TRACE_EVENT(pseudo_lock_l3,
TP_printk("hits=%llu miss=%llu",
__entry->l3_hits, __entry->l3_miss));

+TRACE_EVENT(mon_llc_occupancy_limbo,
+ TP_PROTO(u32 ctrl_hw_id, u32 mon_hw_id, int id, u64 occupancy),
+ TP_ARGS(ctrl_hw_id, mon_hw_id, id, occupancy),
+ TP_STRUCT__entry(__field(u32, ctrl_hw_id)
+ __field(u32, mon_hw_id)
+ __field(int, id)
+ __field(u64, occupancy)),
+ TP_fast_assign(__entry->ctrl_hw_id = ctrl_hw_id;
+ __entry->mon_hw_id = mon_hw_id;
+ __entry->id = id;
+ __entry->occupancy = occupancy;),
+ TP_printk("ctrl_hw_id=%u mon_hw_id=%u domain=%d llc_occupancy=%llu",
+ __entry->ctrl_hw_id, __entry->mon_hw_id, __entry->id, __entry->occupancy)
+ );
+
#endif /* _TRACE_RESCTRL_H */

#undef TRACE_INCLUDE_PATH
--
2.25.1