Re: [PATCH v2 14/23] x86/resctrl: Calculate bandwidth from the previous __mon_event_count() chunks

From: Reinette Chatre
Date: Wed Oct 27 2021 - 16:41:58 EST


Hi James,

On 10/27/2021 9:50 AM, James Morse wrote:
On 15/10/2021 23:28, Reinette Chatre wrote:
On 10/1/2021 9:02 AM, James Morse wrote:
mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
second. It reads the hardware register, calculates the bandwidth and
updates m->prev_bw_msr which is used to hold the previous hardware register
value.

Operating directly on hardware register values makes it difficult to make
this code architecture independent, so that it can be moved to /fs/,
making the mba_sc feature something resctrl supports with no additional
support from the architecture.
Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
register using __mon_event_count().

Looking back I think 06c5fe9b12dd ("x86/resctrl: Fix incorrect local bandwidth when mba_sc
is enabled") may explain how the code ended up the way it is.

Change mbm_bw_count() to use the current chunks value from
__mon_event_count() to calculate bandwidth. This means it no longer
operates on hardware register values.

ok ... so could the patch just do this as it is stated here? The way it is implemented is
very complicated and hard (for me) to verify the correctness (more below).

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 6c8226987dd6..a1232462db14 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c

  static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
  {
      struct rdt_hw_resource *hw_res = resctrl_to_arch_res(rr->r);
      struct mbm_state *m = &rr->d->mbm_local[rmid];
-    u64 tval, cur_bw, chunks;
+    u64 cur_bw, chunks, cur_chunks;
  -    tval = __rmid_read(rmid, rr->evtid);
-    if (tval & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-        return;
+    cur_chunks = rr->val;
+    chunks = cur_chunks - m->prev_bw_chunks;
+    m->prev_bw_chunks = cur_chunks;
  -    chunks = mbm_overflow_count(m->prev_bw_msr, tval, hw_res->mbm_width);
-    cur_bw = (get_corrected_mbm_count(rmid, chunks) * hw_res->mon_scale) >> 20;
+    cur_bw = (chunks * hw_res->mon_scale) >> 20;

I find this quite confusing. What if a new m->prev_chunks is introduced instead and
initialized in __mon_event_count() to the value of chunks, and then here in mbm_bw_count
it could just retrieve it (chunks = m->prev_chunks).

(I agree the diff is noisy, it may be easier as a new function as this is a replacement
not a transform of the existing function)

__mon_event_count() can also be triggered by user-space reading the file, so any of its
'prev' values should be ignored, as they aren't updated on the 1-second timer needed to
get this in MB/s.

The resource group's mutex is taken before __mon_event_count() is called via user-space or via the overflow handler so I think that mbm_bw_count() can assume that the prev values are from the __mon_event_count() called just before it.

__mon_event_count()'s chunks values hasn't been through get_corrected_mbm_count(), so it
would need to be rr->val, which is what this code starts with for the "number of chunks
ever read by this counter".

The value could be corrected in mbm_bw_count(), no?



The variable 'chunks' has been used too much here, so its lost its meaning. How about:
| current_chunk_count = rr->val;
| delta_counter = current_chunk_count - m->prev_chunk_count;
| cur_bw = (delta_counter * hw_res->mon_scale) >> 20;
|
| m->prev_chunk_count = current_chunk_count;


The 'delta_counter' step was previously hidden in mbm_overflow_count(), which also had to
do with overflow of the hardware counter. Because a previously sanitised value is being
used, the hardware counters resolution doesn't need to be considered.
(which helps make mba_sc architecture independent)

This is the part that is not obvious to me: is the difference between the two individually sanitized measurements the same as sanitizing the difference between the two measurements?

Reinette