Re: [RFC PATCH 2/2] resctrl2: Arch x86 modules for most of the legacy control/monitor functions

From: Tony Luck
Date: Mon Jul 10 2023 - 19:36:01 EST


On Thu, Jul 06, 2023 at 12:22:03PM +0200, Peter Newman wrote:
> Hi Tony,
>
> On Wed, Jul 5, 2023 at 6:46 AM Luck, Tony <tony.luck@xxxxxxxxx> wrote:
> > The mbm_poll() code that makes sure that counters don't wrap is
> > doing all the expensive wrmsr(QM_EVTSEL);rdmsr(QM_COUNT)
> > once per second to give you the data you want.
>
> I was doing that in the soft RMID series I posted earlier because it
> simplified things, but then I had some realizations about how much
> error +/- 1 second on the sampling point could result in[1]. We
> usually measure the bandwidth rate with a 5-second window, so a
> reading that's up to one second old would mean a 20% error in the
> bandwidth calculation.

I just pushed the latest version of the resctrl2 patches to the
resctrl2_v65rc1 branch of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git

As well as locking, bug fixes, and general updates it includes an
experimental feature to provide summary MBM information for each
node. E.g. (both "total" and "local" rates are provided). Note
that you have to load modules rdt_mbm_local_bytes and
rdt_mbm_total_bytes so that the MBM overflow threads are
running. I should fix the code to print "n/a" instead of
"0" if they are not.

$ cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_summary
3638 3638 /g2
3087 3087 /g2/m2
3267 3267 /g2/m1
3443 3443 /g1
3629 3629 /g1/m2
3588 3587 /g1/m1
3999 3993 /
3370 3369 /m2
3432 3432 /m1

The rates are produced once per second by the MBM overflow
code. They compute MBytes/sec as "chunks since last poll"
divided by (now - then). I'm using jiffies for the times
which may be good enough. "now - then" is one second (maybe
more if the kernel thread doing the MBM polling is delayed
from running).

I should fix the summarization code to work the same as the
regular MBM files (i.e. make the parent control directory
report the sum of all its children).

The code also attempts (but fails) to make these mbm_summary
files poll(2)-able. With the wakeup dependent on aggregate
measure bandwidth compared against a configurable threshold:

$ cat /sys/fs/resctrl/info/L3_MON/mbm_poll_threshold
10000000

There's something wrong though. Poll(2) always says there is
data to be read. I only see one other piece of kernel code
implementing poll on kernfs (in the cgroup code). Perhaps
my problem is inability to write an appliction that uses
poll(2) correctly.

Let me know if this all seems like a useful direction. Maybe
the polling part is overkill and it is sufficient to just
have a cheap way to get all the bandwidths even if the values
seen might be up to one second old.

-Tony