Re: [RFC PATCH v8] sched: Fix performance regression introduced by mm_cid

From: Mathieu Desnoyers
Date: Wed Apr 19 2023 - 09:11:06 EST

Next message: Greg KH: "Re: [PATCH v3] usb: typec: intel_pmc_mux: Expose IOM port status to debugfs"
Previous message: Jens Axboe: "Re: [PATCH] splice: Fix filemap of a blockdev"
In reply to: Aaron Lu: "Re: [RFC PATCH v8] sched: Fix performance regression introduced by mm_cid"
Next in thread: Aaron Lu: "Re: [RFC PATCH v8] sched: Fix performance regression introduced by mm_cid"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2023-04-19 04:06, Aaron Lu wrote:

On Tue, Apr 18, 2023 at 12:01:09PM -0400, Mathieu Desnoyers wrote:

On 2023-04-18 07:21, Aaron Lu wrote:

On Mon, Apr 17, 2023 at 11:08:31AM -0400, Mathieu Desnoyers wrote:

Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.

For postgres_sysbench on SPR:
sched_mm_cid_migrate_to() is in the range of 0.1x% - 0.4x%, mm_cid_get()
is in the range of 0.1x% - 0.3x%. Other cid functions are pretty minor.

For hackbench on SPR:
ched_mm_cid_migrate_to() is about 3%-4%, mm_cid_get() is about 7%-8%,
other cid functions are pretty minor.

It's a bit higher than I would have expected for hackbench.

Can you run with the attached schedstats patch applied and boot
with schedstats=enable ? Let me know how the counters behave please,
because I cannot reproduce anything unexpected on my machine.

Only event that stood out is mm_cid_get_cached, other events are very
few in a 5s window. I think these numbers are expected and indicate the
optimization worked well. Please see attachment for details, the number
are got by doing the following after 20s the test was started:
touch schedstat
cat /proc/schedstat >> schedstat
sleep 5
cat /proc/schedstat >> schedstat

However, mm_cid_get() still shows as having 7.x% in perf cycles.
Annotate shows the cycles are mostly spent on accessing mm->pcpu_cid:

│
│ lockdep_assert_rq_held(rq);
│ cpumask = mm_cidmask(mm);
│ pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
64.50 │ mov 0x60(%r14),%r12
│ struct rq *rq = this_rq();
9.80 │ mov %rbx,-0x30(%rbp)
│ pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
0.11 │ → call debug_smp_processor_id

^ this call to debug_smp_processor_id is odd. Do you happen to have CONFIG_DEBUG_PREEMPT=y
in your kernel config ? This likely adds unwelcome overhead in those benchmarks.

│ mov %eax,%ebx
│ cmp $0x1fff,%eax

│ raw_spin_unlock(&cid_lock);
│ mov $0xffffffff8474ca84,%rdi
│ WRITE_ONCE(use_cid_lock, 0);
│ movl $0x0,use_cid_lock
│ raw_spin_unlock(&cid_lock);
│ → call _raw_spin_unlock
│ ↑ jmp fe
│ arch_static_branch():
0.02 │236: xchg %ax,%ax
│ mm_cid_snapshot_time():
│ struct rq *rq = this_rq();
│238: → call debug_smp_processor_id
│ mov $0x33340,%r13
│ mov %eax,%ebx
│ cmp $0x1fff,%eax
│ ↓ ja 2b3
│24d: add -0x7d5ce500(,%rbx,8),%r13
│ pcpu_cid = this_cpu_ptr(mm->pcpu_cid);
21.23 │ mov 0x60(%r14),%rbx
4.11 │ → call debug_smp_processor_id
│ mov %eax,%r12d
│ cmp $0x1fff,%eax
│ ↓ ja 2c4
│ WRITE_ONCE(pcpu_cid->time, rq->clock);

Initially I thought it might be due to some false sharing between
pcpu_cid field and other fields of mm_struct, but after I placed pcpu_cid
into a separate cacheline, it still didn't make a difference.

For sched_mm_cid_migrate_to(), its cycle percent is 2.97% and most cycles
are also spent on accessing mm->pcpu_cid according to perf annotate:

│ dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
64.22 │ mov 0x60(%r15),%r13
9.53 │ movslq 0xaf0(%rdi),%r14

No idea why accesing mm->pcpu_cid is so expensive, surely after
initial allocation, nobody would change its value? Without a write
side, I don't see how this can happen...

I'm also considering using __this_cpu* operations, such as this on mm_cid_get():

static inline int mm_cid_get(struct mm_struct *mm)
{
struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
struct rq *rq = this_rq();
struct cpumask *cpumask;
int cid;

lockdep_assert_rq_held(rq);
cpumask = mm_cidmask(mm);
cid = __this_cpu_read(pcpu_cid->cid);
if (mm_cid_is_valid(cid)) {
schedstat_inc(rq->mm_cid_get_cached);
mm_cid_snapshot_time(mm);
return cid;
}
if (mm_cid_is_lazy_put(cid)) {
if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) {
schedstat_inc(rq->mm_cid_get_put_lazy);
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
}
}
cid = __mm_cid_get(mm);
schedstat_inc(rq->mm_cid_get_alloc);
__this_cpu_write(pcpu_cid->cid, cid);
return cid;
}

Would this help ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Next message: Greg KH: "Re: [PATCH v3] usb: typec: intel_pmc_mux: Expose IOM port status to debugfs"
Previous message: Jens Axboe: "Re: [PATCH] splice: Fix filemap of a blockdev"
In reply to: Aaron Lu: "Re: [RFC PATCH v8] sched: Fix performance regression introduced by mm_cid"
Next in thread: Aaron Lu: "Re: [RFC PATCH v8] sched: Fix performance regression introduced by mm_cid"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]