Re: [PATCH v5 31/41] arm_mpam: resctrl: Add resctrl_arch_rmid_read() and resctrl_arch_reset_rmid()

From: Ben Horgan

Date: Mon Mar 09 2026 - 12:39:29 EST


Hi Zeng,

On 3/7/26 09:29, Zeng Heng wrote:
> Hi Ben,
>
> On 2026/2/25 1:57, Ben Horgan wrote:
>> From: James Morse <james.morse@xxxxxxx>
>>
>> resctrl uses resctrl_arch_rmid_read() to read counters. CDP emulation
>> means
>> the counter may need reading in three different ways. The same goes for
>> reset.
>>
>> The helpers behind the resctrl_arch_ functions will be re-used for the
>> ABMC
>> equivalent functions.
>>
>> Add the rounding helper for checking monitor values while we're here.
>>
>> Tested-by: Gavin Shan <gshan@xxxxxxxxxx>
>> Tested-by: Shaopeng Tan <tan.shaopeng@xxxxxxxxxxxxxx>
>> Tested-by: Peter Newman <peternewman@xxxxxxxxxx>
>> Tested-by: Zeng Heng <zengheng4@xxxxxxxxxx>
>> Reviewed-by: Shaopeng Tan <tan.shaopeng@xxxxxxxxxxxxxx>
>> Reviewed-by: Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>
>> Signed-off-by: James Morse <james.morse@xxxxxxx>
>> Signed-off-by: Ben Horgan <ben.horgan@xxxxxxx>
>> ---
>
> [...]
>
>> +
>> +static int read_mon_cdp_safe(struct mpam_resctrl_mon *mon, struct
>> mpam_component *mon_comp,
>> +                 enum mpam_device_features mon_type,
>> +                 int mon_idx, u32 closid, u32 rmid, u64 *val)
>> +{
>> +    if (cdp_enabled) {
>
> While reviewing the resctrl limbo handling code, I noticed a issue in
> __check_limbo() that could lead to premature RMID release when CDP is
> enabled.
>
> In __check_limbo(), RMIDs in limbo state undergo L3 occupancy checks
> before being released. This check is performed via
> resctrl_arch_rmid_read(), on arm64 MPAM, which relies on the cdp_enabled
> state to determine to check which PARTID.
>
> The concern arises in the following scenario: Filesystem is mounted with
> CDP enabled. During normal operation, some RMIDs enter limbo. On umount,
> cdp_enabled is reset to false. __check_limbo() may then run and perform
> L3 checks with cdp_enabled = false. This could cause RMIDs to be
> incorrectly released from limbo while still effectively busy after
> remount.

I think a stale limbo list cause more problems than that. If you mount
with cdp disabled, cause some rmids to be dirty, unmount and then
remount with cdp enabled then you may have some of the entries in upper
half marked as busy but when the limbo code checks them it ends up using
an out of range partid and may trigger an mpam error interrupt.

To avoid a stale list we could disable the limbo checking at unmount and
at remount remake the bitmap. This would involve some resctrl changes
which I will have a further look into. For now, to avoid the dependency
without a lot of patch churn in this series I think we can hide the cdp
enablement behind CONFIG_EXPERT. Does that sound ok to you?

Thanks,

Ben

>
> Apologies for not providing a ready-made fix in this email. However,
> I would appreciate to hear the community's thoughts on this issue.
>
>
>> +        u64 code_val = 0, data_val = 0;
>> +        int err;
>> +
>> +        err = __read_mon(mon, mon_comp, mon_type, mon_idx,
>> +                 CDP_CODE, closid, rmid, &code_val);
>> +        if (err)
>> +            return err;
>> +
>> +        err = __read_mon(mon, mon_comp, mon_type, mon_idx,
>> +                 CDP_DATA, closid, rmid, &data_val);
>> +        if (err)
>> +            return err;
>> +
>> +        *val += code_val + data_val;
>> +        return 0;
>> +    }
>> +
>> +    return __read_mon(mon, mon_comp, mon_type, mon_idx,
>> +              CDP_NONE, closid, rmid, val);
>> +}
>> +
>> +/* MBWU when not in ABMC mode, and CSU counters. */
>> +int resctrl_arch_rmid_read(struct rdt_resource *r, struct
>> rdt_domain_hdr *hdr,
>> +               u32 closid, u32 rmid, enum resctrl_event_id eventid,
>> +               void *arch_priv, u64 *val, void *arch_mon_ctx)
>> +{
>> +    struct mpam_resctrl_dom *l3_dom;
>> +    struct mpam_component *mon_comp;
>> +    u32 mon_idx = *(u32 *)arch_mon_ctx;
>> +    enum mpam_device_features mon_type;
>> +    struct mpam_resctrl_mon *mon = &mpam_resctrl_counters[eventid];
>> +
>> +    resctrl_arch_rmid_read_context_check();
>> +
>> +    if (eventid >= QOS_NUM_EVENTS || !mon->class)
>> +        return -EINVAL;
>> +
>> +    l3_dom = container_of(hdr, struct mpam_resctrl_dom,
>> resctrl_mon_dom.hdr);
>> +    mon_comp = l3_dom->mon_comp[eventid];
>> +
>> +    switch (eventid) {
>> +    case QOS_L3_OCCUP_EVENT_ID:
>> +        mon_type = mpam_feat_msmon_csu;
>> +        break;
>> +    case QOS_L3_MBM_LOCAL_EVENT_ID:
>> +    case QOS_L3_MBM_TOTAL_EVENT_ID:
>> +        mon_type = mpam_feat_msmon_mbwu;
>> +        break;
>> +    default:
>> +        return -EINVAL;
>> +    }
>> +
>> +    return read_mon_cdp_safe(mon, mon_comp, mon_type, mon_idx,
>> +                 closid, rmid, val);
>> +}
>> +
>
>
> Best regards,
> Zeng Heng