Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

From: Christophe Leroy
Date: Fri Jun 18 2021 - 13:14:20 EST




Le 29/01/2018 à 21:20, Mathieu Desnoyers a écrit :
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Looking at switch_mm_irqs_off(), I found it more complex than expected and found that this patch is the reason for that complexity.

Before the patch (ie in kernel 4.14), we have:

00000000 <switch_mm_irqs_off>:
0: 81 24 01 c8 lwz r9,456(r4)
4: 71 29 00 01 andi. r9,r9,1
8: 40 82 00 1c bne 24 <switch_mm_irqs_off+0x24>
c: 39 24 01 c8 addi r9,r4,456
10: 39 40 00 01 li r10,1
14: 7d 00 48 28 lwarx r8,0,r9
18: 7d 08 53 78 or r8,r8,r10
1c: 7d 00 49 2d stwcx. r8,0,r9
20: 40 c2 ff f4 bne- 14 <switch_mm_irqs_off+0x14>
24: 7c 04 18 40 cmplw r4,r3
28: 81 24 00 24 lwz r9,36(r4)
2c: 91 25 04 4c stw r9,1100(r5)
30: 4d 82 00 20 beqlr
34: 48 00 00 00 b 34 <switch_mm_irqs_off+0x34>
34: R_PPC_REL24 switch_mmu_context


After the patch (ie in 5.13-rc6), that now is:

00000000 <switch_mm_irqs_off>:
0: 81 24 02 18 lwz r9,536(r4)
4: 71 29 00 01 andi. r9,r9,1
8: 41 82 00 24 beq 2c <switch_mm_irqs_off+0x2c>
c: 7c 04 18 40 cmplw r4,r3
10: 81 24 00 24 lwz r9,36(r4)
14: 91 25 04 d0 stw r9,1232(r5)
18: 4d 82 00 20 beqlr
1c: 81 24 00 28 lwz r9,40(r4)
20: 71 29 00 0a andi. r9,r9,10
24: 40 82 00 34 bne 58 <switch_mm_irqs_off+0x58>
28: 48 00 00 00 b 28 <switch_mm_irqs_off+0x28>
28: R_PPC_REL24 switch_mmu_context
2c: 39 24 02 18 addi r9,r4,536
30: 39 40 00 01 li r10,1
34: 7d 00 48 28 lwarx r8,0,r9
38: 7d 08 53 78 or r8,r8,r10
3c: 7d 00 49 2d stwcx. r8,0,r9
40: 40 a2 ff f4 bne 34 <switch_mm_irqs_off+0x34>
44: 7c 04 18 40 cmplw r4,r3
48: 81 24 00 24 lwz r9,36(r4)
4c: 91 25 04 d0 stw r9,1232(r5)
50: 4d 82 00 20 beqlr
54: 48 00 00 00 b 54 <switch_mm_irqs_off+0x54>
54: R_PPC_REL24 switch_mmu_context
58: 2c 03 00 00 cmpwi r3,0
5c: 41 82 ff cc beq 28 <switch_mm_irqs_off+0x28>
60: 48 00 00 00 b 60 <switch_mm_irqs_off+0x60>
60: R_PPC_REL24 switch_mmu_context


Especially, the comparison of 'prev' to 0 is pointless as both cases end up with just branching to 'switch_mmu_context'

I don't understand all that complexity to just replace a simple 'smp_mb__after_unlock_lock()'.

#define smp_mb__after_unlock_lock() smp_mb()
#define smp_mb() barrier()
# define barrier() __asm__ __volatile__("": : :"memory")


Am I missing some subtility ?

Thanks
Christophe