Re: [PATCH] x86/resctrl: Fix event counts regression in reused RMIDs

From: Reinette Chatre
Date: Wed Dec 14 2022 - 14:17:59 EST


Hi Peter,

On 12/14/2022 6:21 AM, Peter Newman wrote:
> On Thu, Dec 8, 2022 at 7:31 PM Reinette Chatre
> <reinette.chatre@xxxxxxxxx> wrote:
>>
>> I think this can be cleaned up to make the code more clear. Notice the
>> duplication of following snippet in __mon_event_count():
>> rr->val += tval;
>> return 0;
>>
>> I do not see any need to check the event id before doing the above. That
>> leaves the bulk of the switch just needed for the rr->first handling that
>> can be moved to resctrl_arch_reset_rmid().
>>
>> Something like:
>>
>> void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_domain *d, ...
>> {
>> ...
>> struct arch_mbm_state *am;
>> struct mbm_state *m;
>> u64 val = 0;
>> int ret;
>>
>> m = get_mbm_state(d, rmid, eventid); /* get_mbm_state() to be created */
>
> Good call. When prototyping another change, I quickly found the need to
> create this myself.
>
>> if (m)
>> memset(m, 0, sizeof(*m));
>
> mbm_state is arch-independent, so I think putting it here would require
> the MPAM version to copy this and for get_mbm_state() to be exported.

You are correct, it is arch independent ... so every arch is expected to
have it.
I peeked at your series and that looks good also - having cleanup done in
a central place helps to avoid future mistakes.

>> am = get_arch_mbm_state(hw_dom, rmid, eventid);
>> if (am) {
>> memset(am, 0, sizeof(*am));
>> /* Record any initial, non-zero count value. */
>> ret = __rmid_read(rmid, eventid, &val);
>> if (!ret)
>> am->prev_msr = val;
>> }
>>
>> }
>>
>> Having this would be helpful as reference to Babu's usage.
>
> His usage looks a little different.
>
> According to the comment in Babu's patch:
>
> https://lore.kernel.org/lkml/166990903030.17806.5106229901730558377.stgit@bmoger-ubuntu/
>
> + /*
> + * When an Event Configuration is changed, the bandwidth counters
> + * for all RMIDs and Events will be cleared by the hardware. The
> + * hardware also sets MSR_IA32_QM_CTR.Unavailable (bit 62) for
> + * every RMID on the next read to any event for every RMID.
> + * Subsequent reads will have MSR_IA32_QM_CTR.Unavailable (bit 62)
> + * cleared while it is tracked by the hardware. Clear the
> + * mbm_local and mbm_total counts for all the RMIDs.
> + */
> + resctrl_arch_reset_rmid_all(r, d);
>
> If all the hardware counters are zeroed as the comment suggests, then
> leaving am->prev_msr zero seems correct. __rmid_read() would likely
> return an error anyways. The bug I was addressing was one of reusing
> an RMID which had not been reset.

You are correct, but there are two things to keep in mind though:
* the change from which you copied the above snippet introduces a new
_generic_ utility far away from this call site. It is thus reasonable to
assume that this utility should work for all use cases, not just the one
for which it is created. Since there are no other use cases at this time,
this may be ok, but I think at minimum the utility will benefit from
a snippet indicating the caveats of its use as a heads up to any future users.
* the utility does not clear struct mbm_state contents. Again, this is ok
for this usage since AMD does not support the software controller but
as far as a generic utility goes the usage should be clear to avoid
traps for future changes.

Reinette