Re: [PATCH v3 15/16] arch_topology: Set cluster identifier in each core/thread from /cpu-map
From: Dietmar Eggemann
Date: Mon Jun 13 2022 - 05:20:13 EST
On 10/06/2022 12:27, Sudeep Holla wrote:
> On Fri, Jun 10, 2022 at 12:08:44PM +0200, Vincent Guittot wrote:
>> On Mon, 6 Jun 2022 at 12:22, Sudeep Holla <sudeep.holla@xxxxxxx> wrote:
>>>
>
> [...]
>
>>> Why ? Are you suggesting that we shouldn't present the hardware cluster
>>> to the topology because of the above reason ? If so, sorry that is not a
>>> valid reason. We could add login to return NULL or appropriate value
>>> needed in cpu_clustergroup_mask id it matches MC level mask if we can't
>>> deal that in generic scheduler code. But the topology code can't be
>>> compromised for that reason as it is user visible.
>>
>> I tend to agree with Dietmar. The legacy use of cluster node in DT
>> refers to the dynamiQ or legacy b.L cluster which is also aligned to
>> the LLC and the MC scheduling level. The new cluster level that has
>> been introduced recently does not target this level but some
>> intermediate levels either inside like for the kupeng920 or the v9
>> complex or outside like for the ampere altra. So I would say that
>> there is one cluster node level in DT that refers to the same MC/LLC
>> level and only an additional child/parent cluster node should be used
>> to fill the clustergroup_mask.
>>
>
> Again I completely disagree. Let us look at the problems separately.
> The hardware topology that some of the tools like lscpu and lstopo expects
> what the hardware looks like and not the scheduler's view of the hardware.
> So the topology masks that gets exposed to the user-space needs fixing
> even today. I have reports from various tooling people about the same.
> E.g. Juno getting exposed as dual socket system is utter non-sense.
>
> Yes scheduler uses most of the topology masks as is but that is not a must.
> There are these *group_mask functions that can implement what scheduler
> needs to be fed.
>
> I am not sure why the 2 issues are getting mixed up and that is the main
> reason why I jumped into this to make sure the topology masks are
> not tampered based on the way it needs to be used for scheduler.
I'm all in favor of not mixing up those 2 issues. But I don't understand
why you have to glue them together.
(1) DT systems broken in userspace (lstopo shows Juno with 2 Packages)
(2) Introduce CONFIG_SCHED_CLUSTER for DT systems
(1) This can be solved with your patch-set w/o setting `(1. level)
cpu-map cluster nodes`. The `socket nodes` taking over the
functionality of the `cluster nodes` sorts out the `Juno is seen as
having 2 packages`.
This will make core_sibling not suitable for cpu_coregroup_mask()
anymore. But this is OK since llc from cacheinfo (i.e. llc_sibling)
takes over here.
There is no need to involve `cluster nodes` anymore.
(2) This will only make sense for Armv9 L2 complexes if we connect `2.
level cpu-map cluster nodes` with cluster_id and cluster_sibling.
And only then clusters would mean the same thing in ACPI and DT.
I guess this was mentioned already a couple of times.
> Both ACPI and DT on a platform must present exact same hardware topology
> to the user-space, there is no space for argument there.
>
>> IIUC, we don't describe the dynamiQ level in ACPI which uses cache
>> topology instead to define cpu_coregroup_mask whereas DT described the
>> dynamiQ instead of using cache topology. If you use cache topology
>> now, then you should skip the dynamiQ
>>
>
> Yes, unless someone can work out a binding to represent that and convince
> DT maintainers ;).
>
>> Finally, even if CLS and MC have the same scheduling behavior for now,
>> they might ends up with different scheduling properties which would
>> mean that replacing MC level by CLS one for current SoC would become
>> wrong
>>
>
> Again as I mentioned to Dietmar, that is something we can and must deal with
> in those *group_mask and not expect topology mask to be altered to meet
> CLS/MC or whatever sched domains needs. Sorry, that is my strong opinion
> as the topology is already user-space visible and (tooling) people are
> complaining that DT systems are broken and doesn't match ACPI systems.
>
> So unless someone gives me non-scheduler and topology specific reasons
> to change that, sorry but my opinion on this matter is not going to change ;).
`lstopo` is fine with a now correct /sys/.../topology/package_cpus (or
core_siblings (old filename). It's not reading
/sys/.../topology/cluster_cpus (yet) so why set it (wrongly) to 0x39 for
CPU0 on Juno when it can stay 0x01?
> You will get this view of topology, find a way to manage with all those
> *group_mask functions. By the way it is already handled for ACPI systems,
> so if you are not happy with that, then that needs fixing as this change
> set just aligns the behaviour on similar ACPI system. So the Juno example
> is incorrect for the reason that the behaviour of scheduler there is different
> with DT and ACPI.
[...]