Re: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list
From: Reinette Chatre
Date: Tue Sep 27 2022 - 17:21:45 EST
Hi James and Shawn,
On 9/27/2022 6:06 AM, James Morse wrote:
> On 27/09/2022 03:54, Shawn Wang wrote:
>> Array staged_config in struct rdt_domain still maintains the original value when
>> resctrl is unmounted. If resctrl is mounted with cdp option and then remounted
>> without cdp option, field have_new_ctrl in staged_config[CDP_CODE] and
>> staged_config[CDP_DATA] will still be true.
>
> staged_config[CDP_DATA] is an array - its always 'true'. I think you mean
> staged_config[CDP_DATA].have_new_ctrl, which will still be true because it is only
> memset() when the schemata file is written to.
>
>
>> Since resctrl_arch_update_domains()
>> traverses all resctrl_conf_type, it will continue to update CDP_CODE and
>> CDP_DATA configurations, which can cause overflow problem.
>
> Only if its called with a stale staged config, and it should only be called when the
> schemata file is written to, which would memset() the staged config first.
>
>
>> The problem can be reproduced by the following commands:
>> # A system with 16 usable closids and mba disabled
>> mount -t resctrl resctrl -o cdp /sys/fs/resctrl
>> mkdir /sys/fs/resctrl/p{1..7}
>> umount /sys/fs/resctrl/
>> mount -t resctrl resctrl /sys/fs/resctrl
>> mkdir /sys/fs/resctrl/p{1..8}
>
> Thanks for the reproducer - but I don't see what could set have_new_ctrl in this sequence.
> You can't call apply_config() to set CPUs in the mask without that being set.
>
> Creating a new control group, (your mkdir step) shouldn't touch the hardware at all, as it
> should be left in its reset state from the last umount(), or setup.
There is an attempt to configure the hardware in the mkdir path:
rdtgroup_mkdir()->rdtgroup_mkdir_ctrl_mon()->rdtgroup_init_alloc()->resctrl_arch_update_domains()
This is required in support of the different resource group modes. When a new
resource group is created via mkdir the configuration should respect any
exclusive resource groups that exist at that point.
>
> I can't reproduce this on v6.0-rc7.
> Even if I dirty the configuration by writing to the schemata file, I can't reproduce this.
>
>From what I can tell the reproducer is dependent on (a) whether hardware
supports CDP, and (b) the number of CLOSIDs supported by the system. The reproducer
works for hardware that has 16 CLOSIDs (see later).
> (I have mba enabled, but all this should affect is the number of closid available)
>
>
>> dmesg will generate the following error:
>
> Which kernel version is this?
>
>> [ 6180.939345] unchecked MSR access error: WRMSR to 0xca0 (tried to write
>> 0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60)
>
> Is 0x7ff the default CBM bitmap for this CPU? Or was it written in a step missing from the
> reproducer above?
The value of interest here is the register it tries to write to ... 0xca0.
On a system with 16 CLOSIDs the range of registers available to set the CBM
would be 0xc90 to 0xc9f that corresponds to CLOSID 0 to CLOSID 15. The error is
an attempt to write to an unsupported register - there appears to have been an
attempt to configure non-existent CLOSID 16.
As Shawn already root-caused, this is because the staged_config contains data from
the previous run when CDP was enabled and it is never cleared before the resource group
creation flow (triggered by mkdir).
CDP enabled:
mkdir <resource group>
domainX.staged_config[CDP_NONE].have_new_ctrl = false
domainX.staged_config[CDP_NONE].new_ctrl = 0
domainX.staged_config[CDP_CODE].have_new_ctrl = true
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff
domainX.staged_config[CDP_DATA].have_new_ctrl = true
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff
unmount/remount resctrl (CDP disabled):
mkdir <resource group>
domainX.staged_config[CDP_NONE].have_new_ctrl = true
domainX.staged_config[CDP_NONE].new_ctrl = 0x7ff
domainX.staged_config[CDP_CODE].have_new_ctrl = true /* stale */
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff /* stale */
domainX.staged_config[CDP_DATA].have_new_ctrl = true /* stale */
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff /* stale */
Above becomes an issue when the resource group being created is
for a CLOSID # that is more than half of the CLOSIDs supported.
In the reproducer the issue was encountered when creating resource
group for CLOSID 8 on a system that supports 16 CLOSIDs.
In this case get_config_index() called from
resctrl_arch_update_domains() will return 16 and 17 when processing
this resource group and that translated to an invalid register - 0xca0 in this
scenario.
> The rest of this splat isn't helpful as its the result of an IPI...
>
>> [ 6180.951983] Call Trace:
>> [ 6180.954516] <IRQ>
>> [ 6180.956619] __flush_smp_call_function_queue+0x11d/0x170
>> [ 6180.962028] __sysvec_call_function+0x24/0xd0
>> [ 6180.966485] sysvec_call_function+0x89/0xc0
>> [ 6180.970760] </IRQ>
>> [ 6180.972947] <TASK>
>> [ 6180.975131] asm_sysvec_call_function+0x16/0x20
>> [ 6180.979757] RIP: 0010:cpuidle_enter_state+0xcd/0x400
>> [ 6180.984821] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 1e e5 77 ff 45 84
>> ff 74 12 9c 58 f6 c4 02 0f 85 13 03 00 00 31 ff e8 67 70 7d ff fb 45 85
>> f6 <0f> 88 75 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
>> [ 6181.003710] RSP: 0018:ffffffff83a03e48 EFLAGS: 00000202
>> [ 6181.009028] RAX: ffff943400800000 RBX: 0000000000000001 RCX: 000000000000001f
>> [ 6181.016261] RDX: 0000000000000000 RSI: ffffffff83795059 RDI: ffffffff837c101e
>> [ 6181.023490] RBP: ffff9434c9352000 R08: 0000059f1cb1a05e R09: 0000000000000008
>> [ 6181.030717] R10: 0000000000000001 R11: 0000000000005c66 R12: ffffffff83bbf3a0
>> [ 6181.037944] R13: 0000059f1cb1a05e R14: 0000000000000001 R15: 0000000000000000
>> [ 6181.045202] ? cpuidle_enter_state+0xb2/0x400
>> [ 6181.049678] cpuidle_enter+0x24/0x40
>> [ 6181.053370] do_idle+0x1dd/0x260
>> [ 6181.056713] cpu_startup_entry+0x14/0x20
>> [ 6181.060753] rest_init+0xbb/0xc0
>> [ 6181.064097] arch_call_rest_init+0x5/0xa
>> [ 6181.068137] start_kernel+0x668/0x691
>> [ 6181.071914] secondary_startup_64_no_verify+0xe0/0xeb
>> [ 6181.077086] </TASK>
>
> It would be good to know what triggered this IPI. It may not have been
> resctrl_arch_update_domains(). This pattern also happens from reset_all_ctrls() which
I believe this is indeed from resctrl_arch_update_domains() since it calls
smp_call_function_many() to configure all the domains.
> happens during umount(). (and that would write the default CBM bitmap)
>
> If you can reproduce this easily, could you add dump_stack() to update_config() to see if
> any path is setting have_new_ctrl. You aren't writing to the schemata file in your reproducer.
>
>
>> We fix this issue by clearing the staged configs when destroying schemata list.
>
>
>> Signed-off-by: Shawn Wang <shawnwang@xxxxxxxxxxxxxxxxx>
>> Suggested-by: Xin Hao <xhao@xxxxxxxxxxxxxxxxx>
>
> If we can work out why you are seeing this, it would need a Fixes tag.
>
> Otherwise I agree it makes sense to make this more robust, but it would need a different
> commit message.
What do you think about clearing the staged config within resctrl_arch_update_domains()
after the configuration is complete and there is no more need for it? That may reduce
complexity where each caller no longer need to remember to do so.
I see "staged_config" as a temporary storage and it my help to understand the code better
if it is treated as such.
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index f276aff521e8..b4a817ae83ab 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2127,8 +2127,15 @@ static int schemata_list_create(void)
>> static void schemata_list_destroy(void)
>> {
>> struct resctrl_schema *s, *tmp;
>> + struct rdt_domain *dom;
>>
>> list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) {
>> + /*
>> + * Clear staged_config on each domain before schemata list is
>> + * destroyed.
>> + */
>> + list_for_each_entry(dom, &s->res->domains, list)
>> + memset(dom->staged_config, 0, sizeof(dom->staged_config));
>> list_del(&s->list);
>> kfree(s);
>> }
>
Reinette