Re: [PATCH v1 1/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

From: Peter Zijlstra
Date: Wed Oct 09 2024 - 05:08:06 EST


On Thu, Oct 03, 2024 at 08:44:38PM -0400, Mathieu Desnoyers wrote:
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> a reference to the concurrency id allocated for each CPU. This reference
> expires shortly after a 100ms delay.
>
> These per-CPU references keep the per-mm-cid data cache-local in
> situations where threads are running at least once on each CPU within
> each 100ms window, thus keeping the per-cpu reference alive.
>
> However, intermittent workloads behaving in bursts spaced by more than
> 100ms on each CPU exhibit bad cache locality and degraded performance
> compared to purely per-cpu data indexing, because concurrency IDs are
> allocated over various CPUs and cores, therefore losing cache locality
> of the associated data.
>
> Introduce the following changes to improve per-mm-cid cache locality:
>
> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
> track of which mm_cid value was last used, and use it as a hint to
> attempt re-allocating the same concurrency ID the next time this
> mm/cpu needs to allocate a concurrency ID,
>
> - Add a per-mm CPUs allowed mask, which keeps track of the union of
> CPUs allowed for all threads belonging to this mm. This cpumask is
> only set during the lifetime of the mm, never cleared, so it
> represents the union of all the CPUs allowed since the beginning of
> the mm lifetime. (note that the mm_cpumask() is really arch-specific
> and tailored to the TLB flush needs, and is thus _not_ a viable
> approach for this)

Because my morning juice came with an excessive dose of pedantry this
morning -- the previous and next item end with a comma due to this being
an enumeration; but this one has a full stop, suggesting the iteration
is at an end.

> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
> per-mm CPUs allowed mask (for fast access),
>
> - Add a per-mm nr_cids_used to keep track of the highest concurrency
> ID allocated for the mm. This is used for expanding the concurrency ID
> allocation within the upper bound defined by:

The description and naming disagree -- while from vague memories they
end up being similar -- it is a stumbling block this morning. The
description seems to suggest this should be called max_cid or somesuch.

Also, is it actually used for anything? I found the tracking code in
__mm_cid_try_get(), but it's not actually doing anything?

> min(mm->nr_cpus_allowed, mm->mm_users)
>
> When the next unused CID value reaches this threshold, stop trying
> to expand the cid allocation and use the first available cid value
> instead.
>
> Spreading allocation to use all the cid values within the range
>
> [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
>
> improves cache locality while preserving mm_cid compactness within the
> expected user limits.

This paragraph seems to rudely interrupt the iteration ? Or is (Fred)
Colon gone missing again to start a new iteration?

(Damn, and now I need me a Nobby reference somehow)

Anyway, I have vague memories I strongly suggested keeping the CID space
dense at some point :-)

> - In __mm_cid_try_get, only return cid values within the range
> [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
> prevents allocating cids above the number of allowed cpus in
> rare scenarios where cid allocation races with a concurrent
> remote-clear of the per-mm/cpu cid. This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.

and no comma to continue the iteration.

> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
> t->nr_cpus_allowed. This criterion was really meant to compare
> the number of mm->mm_users to the number of CPUs allowed for the
> entire mm. Therefore, the prior comparison worked fine when all
> threads shared the same CPUs allowed mask, but not so much in
> scenarios where those threads have different masks (e.g. each
> thread pinned to a single CPU). This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.
>

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e3bdf8e38bc..8b5a185b4d5a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -782,6 +782,7 @@ struct vm_area_struct {
> struct mm_cid {
> u64 time;
> int cid;
> + int recent_cid;
> };
> #endif
>
> @@ -852,6 +853,27 @@ struct mm_struct {
> * When the next mm_cid scan is due (in jiffies).
> */
> unsigned long mm_cid_next_scan;
> + /**
> + * @nr_cpus_allowed: Number of CPUs allowed for mm.
> + *
> + * Number of CPUs allowed in the union of all mm's
> + * threads allowed CPUs.
> + */
> + atomic_t nr_cpus_allowed;
> + /**
> + * @nr_cids_used: Number of used concurrency IDs.
> + *
> + * Track the highest concurrency ID allocated for the
> + * mm: nr_cids_used - 1.
> + */
> + atomic_t nr_cids_used;
> + /**
> + * @cpus_allowed_lock: Lock protecting mm cpus_allowed.
> + *
> + * Provide mutual exclusion for mm cpus_allowed and
> + * mm nr_cpus_allowed updates.

If nr_cpus_allowed update is serialized by this here thing, why is it an
atomic_t? A quick search seems to suggest you're only using atomic_set()
/ atomic_read() on it, which is a big fat clue it shouldn't be atomic_t.

Am I missing something?

> + */
> + spinlock_t cpus_allowed_lock;
> #endif
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* size of all page tables */
> @@ -1170,18 +1192,30 @@ static inline int mm_cid_clear_lazy_put(int cid)
> return cid & ~MM_CID_LAZY_PUT;
> }
>
> +/*
> + * mm_cpus_allowed: Union of all mm's threads allowed CPUs.
> + */
> +static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
> +{
> + unsigned long bitmap = (unsigned long)mm;
> +
> + bitmap += offsetof(struct mm_struct, cpu_bitmap);
> + /* Skip cpu_bitmap */
> + bitmap += cpumask_size();
> + return (struct cpumask *)bitmap;
> +}
> +
> /* Accessor for struct mm_struct's cidmask. */
> static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
> {
> - unsigned long cid_bitmap = (unsigned long)mm;
> + unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
>
> - cid_bitmap += offsetof(struct mm_struct, cpu_bitmap);
> - /* Skip cpu_bitmap */
> + /* Skip mm_cpus_allowed */
> cid_bitmap += cpumask_size();
> return (struct cpumask *)cid_bitmap;
> }
>
> -static inline void mm_init_cid(struct mm_struct *mm)
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
> {
> int i;
>
> @@ -1189,17 +1223,22 @@ static inline void mm_init_cid(struct mm_struct *mm)
> struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
>
> pcpu_cid->cid = MM_CID_UNSET;
> + pcpu_cid->recent_cid = MM_CID_UNSET;
> pcpu_cid->time = 0;
> }
> + atomic_set(&mm->nr_cpus_allowed, p->nr_cpus_allowed);
> + atomic_set(&mm->nr_cids_used, 0);
> + spin_lock_init(&mm->cpus_allowed_lock);
> + cpumask_copy(mm_cpus_allowed(mm), p->cpus_ptr);

Should that not be using p->cpus_mask ? I mean, it is unlikely this code
is ran during migrate_disable(), but just in case that ever does do
happen, we'll be getting a spurious single CPU mask.

> cpumask_clear(mm_cidmask(mm));
> }
>
> -static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
> +static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
> {
> mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
> if (!mm->pcpu_cid)
> return -ENOMEM;
> - mm_init_cid(mm);
> + mm_init_cid(mm, p);
> return 0;
> }
> #define mm_alloc_cid(...) alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
> @@ -1212,16 +1251,31 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
>
> static inline unsigned int mm_cid_size(void)
> {
> - return cpumask_size();
> + return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */
> +}
> +
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
> +{
> + struct cpumask *mm_allowed = mm_cpus_allowed(mm);
> +
> + if (!mm)
> + return;
> + /* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
> + spin_lock(&mm->cpus_allowed_lock);
> + cpumask_or(mm_allowed, mm_allowed, cpumask);
> + atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed));
> + spin_unlock(&mm->cpus_allowed_lock);

We're having a problem here, you call this from __do_set_cpus_allowed(),
which is holding rq->lock, which is a raw_spinlock_t.

> }
> #else /* CONFIG_SCHED_MM_CID */
> -static inline void mm_init_cid(struct mm_struct *mm) { }
> -static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
> +static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
> static inline void mm_destroy_cid(struct mm_struct *mm) { }
> +
> static inline unsigned int mm_cid_size(void)
> {
> return 0;
> }
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
> #endif /* CONFIG_SCHED_MM_CID */
>
> struct mmu_gather;

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 43e453ab7e20..772a3daf784a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2691,6 +2691,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
> put_prev_task(rq, p);
>
> p->sched_class->set_cpus_allowed(p, ctx);
> + mm_set_cpus_allowed(p->mm, ctx->new_mask);

This here, is with p->pi_lock and rq->lock held -- both are
raw_spinlock_t.

>
> if (queued)
> enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);