Re: [RFC PATCH 2/3] sched/topology: Rework CPU capacity asymmetry detection
From: Beata Michalska
Date: Thu Apr 22 2021 - 06:27:36 EST
On Wed, Apr 21, 2021 at 08:58:01PM +0100, Valentin Schneider wrote:
Hi Valentin,
>
> Hi Beata,
>
> On 16/04/21 14:01, Beata Michalska wrote:
> > Currently the CPU capacity asymmetry detection, performed through
> > asym_cpu_capacity_level, tries to identify the lowest topology level
> > at which the highest CPU capacity is being observed, not necessarily
> > finding the level at which all possible capacity values are visible
> > to all CPUs
>
> Despite the latter being what it says on the tin! (see comment atop
> asym_cpu_capacity_level())
There are some labels on the tins out there that can be deceiving ...
>
> >, which might be bit problematic for some possible/valid
> > asymmetric topologies i.e.:
> >
> > DIE [ ]
> > MC [ ][ ]
> >
> > CPU [0] [1] [2] [3] [4] [5] [6] [7]
> > Capacity |.....| |.....| |.....| |.....|
> > L M B B
> >
> > Where:
> > arch_scale_cpu_capacity(L) = 512
> > arch_scale_cpu_capacity(M) = 871
> > arch_scale_cpu_capacity(B) = 1024
> >
> > In this particular case, the asymmetric topology level will point
> > at MC, as all possible CPU masks for that level do cover the CPU
> > with the highest capacity. It will work just fine for the first
> > cluster, not so much for the second one though (consider the
> > find_energy_efficient_cpu which might end up attempting the energy
> > aware wake-up for a domain that does not see any asymmetry at all)
> >
>
> Another problematic topology is something the likes of
>
> DIE [ ]
> MC [ ][ ]
> L M B B
>
> Because here the asymmetric tl will be DIE, so we won't properly recognize
> that MC domain with L+M as having CPU asymmetry. That means no misfit
> upmigration from L to M, for one.
>
> The Exynos-based Galaxy S10 *almost* matches that topology - from what I've
> been able to scrounge, all CPUs are hooked up to the same LLC *but* the big
> CPUs have exclusive access to some part of it. From the devicetree files
> I've been able to see, the big cores are actually described as having their
> own LLC.
>
That's also an example of modification to wake-up path to workaround the
current problem (looking at available kernel for that platform)
> Regardless, the topology you describe in this changelog is something that's
> achievable by cobbling two DynamIQ clusters (each with their own LLC) to an
> interconnect, which the architecture supports (IIRC up to something like 32
> clusters).
>
> > Rework the way the capacity asymmetry levels are being detected,
> > to point to the lowest topology level( for a given CPU), where full
> > range of available CPU capacities is visible to all CPUs within given
> > domain. As a result, the per-cpu sd_asym_cpucapacity might differ
> > across the domains. This will have an impact on EAS wake-up placement
> > in a way that it might see different range of CPUs to be considered,
> > depending on the given current and target CPUs.
> >
> > Additionally, those levels, where any range of asymmetry (not
> > necessarily full) is being detected will get identified as well.
> > The selected asymmetric topology level will be denoted by
> > SD_ASYM_CPUCAPACITY_FULL sched domain flag whereas the 'sub-levels'
> > would receive the already used SD_ASYM_CPUCAPACITY flag. This allows
> > maintaining the current behaviour for asymmetric topologies, with
> > misfit migration operating correctly on lower levels, if applicable,
> > as any asymmetry is enough to trigger the misfit migration.
> > The logic there relies on the SD_ASYM_CPUCAPACITY flag and does not
> > relate to the full asymmetry level denoted by the sd_asym_cpucapacity
> > pointer.
> >
> > Signed-off-by: Beata Michalska <beata.michalska@xxxxxxx>
>
> Most of this looks OK to me, I have a few comments below but nothing
> major. FWIW I've appended a diff at the tail of that email which covers
> (most) said comments.
>
> > ---
> > kernel/sched/topology.c | 339 ++++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 299 insertions(+), 40 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 09d3504..9dfa66b 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -675,7 +675,7 @@ static void update_top_cache_domain(int cpu)
> > sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> > rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> >
> > - sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
> > + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> > rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> > }
> >
> > @@ -1958,65 +1958,322 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
> >
> > return true;
> > }
> > +/**
> > + * Asym capacity bits
> > + */
> > +
> > +/**
> > + * Cached cpu masks for those sched domains, at a given topology level,
> > + * that do represent CPUs with asymmetric capacities.
> > + *
> > + * Each topology level will get the cached data assigned,
> > + * with asym cap sched_flags (SD_ASYM_CPUCAPACITY and SD_ASYM_CPUCAPACITY_FULL
> > + * accordingly) and the corresponding cpumask for:
> > + * - domains that do span CPUs with different capacities
> > + * - domains where all CPU capacities are visible for all CPUs within
> > + * the domain
> > + *
> > + * Within a single topology level there might be domains
> > + * with different scope of asymmetry:
> > + * none -> .
> > + * partial -> SD_ASYM_CPUCAPACITY
> > + * full -> SD_ASYM_CPUCAPACITY|SD_ASYM_CPUCAPACITY_FULL
> > + */
> > +struct asym_cache_data {
> > +
> > + struct sched_domain_topology_level *tl;
> > + unsigned int sched_flags;
> > + struct cpumask *asym_mask;
> > + struct cpumask *asym_full_mask;
> > +};
> > +
>
> I'll dump this here because I think it can be useful to whoever else will
> stare at this:
>
> This is pretty much an extension of struct sched_domain_topology_level,
> providing a new / modified tl->sd_flags() output. Unfortunately, said
> output requires either a cpumask per flag per topology level or a flag
> accumulator per topology level per CPU.
>
> In light of this, it's preferable to keep this extra data outside of the
> sched_domain_topology_level struct and have its lifespan limited to the
> domain build, which is what's being done here.
>
> > +*asym_cpu_capacity_scan(const struct cpumask *cpu_map)
> > {
> > - int i, j, asym_level = 0;
> > - bool asym = false;
> > + /*
> > + * Simple data structure to record all available CPU capacities.
> > + * Additional scan level allows tracking unique capacities per each
> > + * topology level and each separate topology level CPU mask.
> > + * During each scan phase, the scan level will allow to determine,
> > + * whether given capacity has been already accounted for, by syncing
> > + * it with the scan stage id.
> > + */
> > + struct capacity_entry {
> > + struct list_head link;
> > + unsigned long capacity;
> > + unsigned int scan_level;
> > + };
> > +
> > struct sched_domain_topology_level *tl, *asym_tl = NULL;
> > - unsigned long cap;
> > + struct asym_cache_data *scan_data = NULL;
> > + struct capacity_entry *entry = NULL, *tmp;
> > + unsigned int level_count = 0;
> > + unsigned int cap_count = 0;
> > + unsigned int scan_id = 0;
> > + LIST_HEAD(capacity_set);
> > + unsigned long capacity;
> > + cpumask_var_t cpu_mask;
> > + int cpu;
> >
> > - /* Is there any asymmetry? */
> > - cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));
> > + /* Build-up a list of all CPU capacities, verifying on the way
> > + * if there is any asymmetry at all
>
> That's a wrong comment style.
>
> > /*
> > - * Examine topology from all CPU's point of views to detect the lowest
> > - * sched_domain_topology_level where a highest capacity CPU is visible
> > - * to everyone.
> > + * Allocate an array to store cached data
> > + * per each topology level + sentinel
> > */
> > - for_each_cpu(i, cpu_map) {
> > - unsigned long max_capacity = arch_scale_cpu_capacity(i);
> > - int tl_id = 0;
> > + scan_data = kcalloc(level_count + 1, sizeof(*scan_data), GFP_KERNEL);
>
> Given we have one cache per tl, do we need the sentinel? I gave that a shot
> and it didn't explode, also further simplified asym_cpu_capacity_verify(),
> see appended diff.
>
It won't explode now indeed, though sentinel here is to make it bulletproof
for potential future changes to the code around it.
> > + if (!scan_data) {
> > + free_cpumask_var(cpu_mask);
> > + goto leave;
> > + }
> [...]
> > + for_each_cpu_and(i, mask, cpu_map) {
> > +
> > + capacity = arch_scale_cpu_capacity(i);
> >
> > - max_capacity = capacity;
> > - asym_level = tl_id;
> > - asym_tl = tl;
> > + list_for_each_entry(entry, &capacity_set, link) {
> > + if (entry->capacity == capacity
> > + && entry->scan_level < scan_id) {
> ^^
> Operand should be at EOL.
>
> > + entry->scan_level = scan_id;
> > + ++local_cap_count;
> > + }
> > + }
> > + __cpumask_clear_cpu(i, cpu_mask);
> > + }
> > @@ -2049,9 +2306,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >
> > sd = NULL;
> > for_each_sd_topology(tl) {
> > - if (tl == tl_asym) {
> > - dflags |= SD_ASYM_CPUCAPACITY;
> > - has_asym = true;
> > + if (!(dflags & SD_ASYM_CPUCAPACITY_FULL)) {
> > + dflags |= asym_cpu_capacity_verify(asym_scan_data,
> > + tl, i);
> > + has_asym = dflags & SD_ASYM_CPUCAPACITY;
> > }
>
> Given this dflags & SD_ASYM_CPUCAPACITY_FULL check, is the maskless
> optimization thing actually required?
>
> AIUI, for any CPU, the first topology level where we'll set
> SD_ASYM_CPUCAPACITY_FULL should have a matching
> asym_scan_data[tlid]->asym_full_mask, and all subsequent tl's will see that
> in dflags and not call into asym_cpu_capacity_verify().
>
That's the point here, yes.
It doesn't bring huge benefit but still I prefer to skip unnecessary iterations
through the cache table if possible. Although with the changes proposed that
might go away.
> >
> > if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
> > @@ -2068,6 +2326,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > }
> > }
> >
> > + asym_cpu_capacity_release_data(asym_scan_data);
> > /* Build the groups for the domains */
> > for_each_cpu(i, cpu_map) {
> > for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > --
> > 2.7.4
>
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 1f965293cc7e..31d89868f208 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2011,102 +2011,80 @@ static bool topology_span_sane(struct sched_domain_topology_level *tl,
> * full -> SD_ASYM_CPUCAPACITY|SD_ASYM_CPUCAPACITY_FULL
> */
> struct asym_cache_data {
> -
> - struct sched_domain_topology_level *tl;
> unsigned int sched_flags;
> struct cpumask *asym_mask;
> struct cpumask *asym_full_mask;
> };
>
> -static inline int asym_cpu_capacity_verify(struct asym_cache_data *data,
> - struct sched_domain_topology_level *tl, int cpu)
> +static inline int asym_cpu_capacity_verify(struct asym_cache_data *data, int cpu)
> {
> - int flags = 0;
> -
> if (!data)
> goto leave;
>
> - while (data->tl) {
> - if (data->tl != tl) {
> - ++data;
> - continue;
> - }
> -
> - if (!data->sched_flags)
> - break;
> - /*
> - * For topology levels above one, where all CPUs observe
> - * all available capacities, CPUs mask is not being
> - * cached for optimization reasons, assuming, that at this
> - * point, all possible CPUs are being concerned.
> - * Those levels will have both:
> - * SD_ASYM_CPUCAPACITY and SD_ASYM_CPUCAPACITY_FULL
> - * flags set.
> - */
> - if (data->sched_flags & SD_ASYM_CPUCAPACITY_FULL &&
> - !data->asym_full_mask) {
> - flags = data->sched_flags;
> - break;
> - }
> -
> - if (data->asym_full_mask &&
> - cpumask_test_cpu(cpu, data->asym_full_mask)) {
> - flags = data->sched_flags;
> - break;
> - }
> - /*
> - * A given topology level might be marked with
> - * SD_ASYM_CPUCAPACITY_FULL mask but only for a certain subset
> - * of CPUs.
> - * Consider the following:
> - * #1
> - *
> - * DIE [ ]
> - * MC [ ][ ]
> - * [0] [1] [2] [3] [4] [5] [6] [7]
> - * |.....| |.....| |.....| |.....|
> - * L M B B
> - *
> - * where:
> - * arch_scale_cpu_capacity(L) = 512
> - * arch_scale_cpu_capacity(M) = 871
> - * arch_scale_cpu_capacity(B) = 1024
> - *
> - * MC topology level will be marked with both
> - * SD_ASYM_CPUCAPACITY flags, but the relevant masks will be:
> - * asym_full_mask = [0-5]
> - * asym_mask empty (no other asymmetry apart from
> - * already covered [0-5])
> - *
> - * #2
> - *
> - * DIE [ ]
> - * MC [ ][ ]
> - * [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
> - * |.....| |.....| |.....| |.....| |.....|
> - * L M B L B
> - *
> - * MC topology level will be marked with both
> - * SD_ASYM_CPUCAPACITY flags, but the relevant masks will be:
> - * asym_full_mask = [0-5]
> - * asym_mask = [6-9]
> - */
> - if (data->asym_mask && cpumask_test_cpu(cpu, data->asym_mask))
> - flags = SD_ASYM_CPUCAPACITY;
> - break;
> + if (!data->sched_flags)
> + goto leave;
> + /*
> + * For topology levels above one, where all CPUs observe all available
> + * capacities, CPUs mask is not being cached for optimization reasons,
> + * assuming, that at this point, all possible CPUs are being concerned.
> + * Those levels will have both: SD_ASYM_CPUCAPACITY and
> + * SD_ASYM_CPUCAPACITY_FULL flags set.
> + */
> + if (data->sched_flags & SD_ASYM_CPUCAPACITY_FULL && !data->asym_full_mask)
> + return data->sched_flags;
>
> - }
> + if (data->asym_full_mask && cpumask_test_cpu(cpu, data->asym_full_mask))
> + return data->sched_flags;
> + /*
> + * A given topology level might be marked with SD_ASYM_CPUCAPACITY_FULL
> + * mask but only for a certain subset of CPUs.
> + * Consider the following:
> + * #1
> + *
> + * DIE [ ]
> + * MC [ ][ ]
> + * [0] [1] [2] [3] [4] [5] [6] [7]
> + * |.....| |.....| |.....| |.....|
> + * L M B B
> + *
> + * where:
> + * arch_scale_cpu_capacity(L) = 512
> + * arch_scale_cpu_capacity(M) = 871
> + * arch_scale_cpu_capacity(B) = 1024
> + *
> + * MC topology level will be marked with both SD_ASYM_CPUCAPACITY flags,
> + * but the relevant masks will be:
> + * asym_full_mask = [0-5]
> + * asym_mask empty (no other asymmetry apart from
> + * already covered [0-5])
> + *
> + * #2
> + *
> + * DIE [ ]
> + * MC [ ][ ]
> + * [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
> + * |.....| |.....| |.....| |.....| |.....|
> + * L M B L B
> + *
> + * MC topology level will be marked with both SD_ASYM_CPUCAPACITY flags,
> + * but the relevant masks will be:
> + * asym_full_mask = [0-5]
> + * asym_mask = [6-9]
> + */
> + if (data->asym_mask && cpumask_test_cpu(cpu, data->asym_mask))
> + return SD_ASYM_CPUCAPACITY;
> leave:
> - return flags;
> + return 0;
> }
>
>
> static inline void asym_cpu_capacity_release_data(struct asym_cache_data *data)
> {
> + struct sched_domain_topology_level *tl;
> struct asym_cache_data *__data = data;
>
> if (data) {
> - while (data->tl) {
> + for_each_sd_topology(tl) {
> if (!data->sched_flags)
> goto next;
> if (data->sched_flags & SD_ASYM_CPUCAPACITY_FULL)
> @@ -2114,7 +2092,7 @@ static inline void asym_cpu_capacity_release_data(struct asym_cache_data *data)
> kfree(data->asym_mask);
> next:
> ++data;
> - };
> + }
> kfree(__data);
> }
> }
> @@ -2168,7 +2146,8 @@ static struct asym_cache_data
> cpumask_var_t cpu_mask;
> int cpu;
>
> - /* Build-up a list of all CPU capacities, verifying on the way
> + /*
> + * Build-up a list of all CPU capacities, verifying on the way
> * if there is any asymmetry at all
> */
> for_each_cpu(cpu, cpu_map) {
> @@ -2201,11 +2180,8 @@ static struct asym_cache_data
>
> /* Get the number of topology levels */
> for_each_sd_topology(tl) level_count++;
> - /*
> - * Allocate an array to store cached data
> - * per each topology level + sentinel
> - */
> - scan_data = kcalloc(level_count + 1, sizeof(*scan_data), GFP_KERNEL);
> + /* Allocate an array to store cached data per each topology level */
> + scan_data = kcalloc(level_count, sizeof(*scan_data), GFP_KERNEL);
> if (!scan_data) {
> free_cpumask_var(cpu_mask);
> goto leave;
> @@ -2221,19 +2197,16 @@ static struct asym_cache_data
>
> #ifdef CONFIG_NUMA
> /*
> - * For NUMA we might end-up in a sched domain
> - * that spans numa nodes with cpus with
> - * different capacities which would not be caught
> - * by the above scan as those will have
> - * separate cpumasks - subject to numa level
> + * For NUMA we might end-up in a sched domain that spans numa
> + * nodes with cpus with different capacities which would not be
> + * caught by the above scan as those will have separate cpumasks
> + * - subject to numa level
> * @see: sched_domains_curr_level & sd_numa_mask
> * Considered to be a no-go
> */
> if (WARN_ON_ONCE(tl->numa_level && !full_asym))
> goto leave;
> #endif
> - data->tl = tl;
> -
> if (asym_tl) {
> data->sched_flags = SD_ASYM_CPUCAPACITY |
> SD_ASYM_CPUCAPACITY_FULL;
> @@ -2247,10 +2220,10 @@ static struct asym_cache_data
> int i;
>
> /*
> - * Tracking each CPU capacity 'scan' id
> - * to distinguish discovered capacity sets
> - * between different CPU masks at each topology level:
> - * capturing unique capacity values at each scan stage
> + * Tracking each CPU capacity 'scan' id to distinguish
> + * discovered capacity sets between different CPU masks
> + * at each topology level: capturing unique capacity
> + * values at each scan stage
> */
> ++scan_id;
> local_cap_count = 0;
> @@ -2261,8 +2234,8 @@ static struct asym_cache_data
> capacity = arch_scale_cpu_capacity(i);
>
> list_for_each_entry(entry, &capacity_set, link) {
> - if (entry->capacity == capacity
> - && entry->scan_level < scan_id) {
> + if (entry->capacity == capacity &&
> + entry->scan_level < scan_id) {
> entry->scan_level = scan_id;
> ++local_cap_count;
> }
> @@ -2334,12 +2307,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> for_each_cpu(i, cpu_map) {
> struct sched_domain_topology_level *tl;
> int dflags = 0;
> + int tlid = 0;
>
> sd = NULL;
> for_each_sd_topology(tl) {
> if (!(dflags & SD_ASYM_CPUCAPACITY_FULL)) {
> - dflags |= asym_cpu_capacity_verify(asym_scan_data,
> - tl, i);
> + dflags |= asym_cpu_capacity_verify(&asym_scan_data[tlid], i);
> has_asym = dflags & SD_ASYM_CPUCAPACITY;
> }
>
> @@ -2354,6 +2327,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> sd->flags |= SD_OVERLAP;
> if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> break;
> +
> + tlid++;
> }
> }
>
This might indeed simplify things a bit.
I've tried to get the changes self-contained and not 'expose' the internals
of caching mechanism to the users but I might drop the idea in favour of making
it less cumbersome.
I will let it sit for a day or two before sending off v2, hoping for more input
on the changes.
Thanks for your feedback.
---
BR
B.