Then we clearly have another member of mm_struct on the same cache line as
pcpu_cid which is bouncing all over the place and causing false-sharing. Any
idea which field(s) are causing this ?
That's my first reaction too but as I said in an earlier reply:
https://lore.kernel.org/lkml/20230419080606.GA4247@ziqianlu-desk2/
I've tried to place pcpu_cid into a dedicate cacheline with no other
fields sharing a cacheline with it in mm_struct but it didn't help...
I see two possible culprits there:
1) The mm_struct pcpu_cid field is suffering from false-sharing. I would be
interested to look at your attempt to move it to a separate cache line to
try to figure out what is going on.
Brain damaged...my mistake, I only made sure its following fields not
share the same cacheline but forgot to exclude its preceding fields and
turned out it's one(some?) of the preceeding fields that caused false
sharing. When I did:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5eab61156f0e..a6f9d815991c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -606,6 +606,7 @@ struct mm_struct {
*/
atomic_t mm_count;
#ifdef CONFIG_SCHED_MM_CID
+ CACHELINE_PADDING(_pad1_);
/**
* @pcpu_cid: Per-cpu current cid.
*
mm_cid_get() dropped to 0.0x% when running hackbench :-)
sched_mm_cid_migrate_to() is about 4% with most cycles spent on
accessing mm->mm_users:
│ dst_cid = READ_ONCE(dst_pcpu_cid->cid);
0.03 │ mov 0x8(%r12),%r15d
│ if (!mm_cid_is_unset(dst_cid) &&
0.07 │ cmp $0xffffffff,%r15d
│ ↓ je 87
│ arch_atomic_read():
│ {
│ /*
│ * Note for KASAN: we deliberately don't use READ_ONCE_NOCHECK() here,
│ * it's non-inlined function that increases binary size and stack usage.
│ */
│ return __READ_ONCE((v)->counter);
76.13 │ mov 0x54(%r13),%eax
│ sched_mm_cid_migrate_to():
│ cmp %eax,0x410(%rdx)
21.71 │ ↓ jle 1d8
│ atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
With this info, it should be mm_users that caused false sharing for
pcpu_cid previously. Looks like mm_users is bouncing.