[RFC] Potential unnecessary barrier in slow path of rt_mutex

From: Hernan Ponce de Leon
Date: Sun Jan 22 2023 - 10:06:41 EST


Hello,

We have been trying to verify that the rt_mutex patch https://lkml.org/lkml/2022/12/2/279 guarantees the intended acquire semantics and that there are no other potential problems with it. For that, we are using a verification tool as discussed in https://lkml.org/lkml/2022/8/26/597.

The tool reported a data race for which I already submitted a patch in https://lkml.org/lkml/2023/1/20/702.

During the discussion of the rt_mutex patch, Will proposed replacing the explicit barrier in 'mark_rt_mutex_waiters' with 'smp_acquire__after_ctrl_dep()'. We wanted to check if it would be possible to get rid of the barrier instead of weakening it.

While according to LKMM /tools/memory-model/linux-kernel.cat, mutual exclusion is violated if we do this, the verification tool reported that this violation is not possible according to the formal memory models of aarch64, riscv and power. Interestingly, if the data race from above is fixed, mutual exclusion is guaranteed by LKMM even if we fully remove the barrier. The reason for this is that marking the racy access introduces more ordering guarantees.

Another possibility is to keep the barrier, but revert the change WRITE_ONCE() -> xchg_acquire() in 'rt_mutex_set_owner'. Boqun suggested this one might be better because it's weird using xchg_acquire() to get an acquire store.

Either the barrier or the acquire store are needed for the correctness of the algorithm. If we relax both, the tool reports a violation.

I would like to get some feedback about the following.

(1) Since the barriers are in the slow path of the algorithm, is it worth it to try to play it smart or should we play it safe (at the cost of potentially having an extra barrier)?

(2) In which cases were the acquiring semantics missing before the rt_mutex patch? This is not clear for me from the patch and the corresponding discussion. Once I understand this, I might be able to use the rules from LKMM to give a more formal argument of why one of the barriers is not needed (or find an example that shows that indeed both barriers are necessary).

(3) If we agree that one barrier might not necessary, but we still keep both to be safe, would it make sense to add some comment in the code along the lines "Of these N barriers, you can probably get away with the following subset of them, but we leave all in place to be safe"?

The following litmus test (I hope comments are enough to trace back to the real implementation) shows all the above issues in detail and how the changes I propose impact them. It can be run using the herd7 tool http://diy.inria.fr/www/?record=linux#.
The final query asks if P0 can release the lock via the fast path even if P1 set the wait bit. The expected result is "No".

C rt_mutex

{
atomic_t owner = ATOMIC_INIT(0);
}

P0(int *owner, int *x) {
int r0 = 2; // current task
int r1 = atomic_cmpxchg_acquire(owner,0,r0); // rtlock acquire
int r2 = READ_ONCE(*x); // critical section;
WRITE_ONCE(*x, r2 + 1); // marked to avoid data races
int r3 = atomic_cmpxchg_release(owner,r0,0); // rtlock release succeeds
}

P1(int *owner, int *x) {
int r0 = 4; // current task
int r1 = atomic_cmpxchg_acquire(owner,0,r0); // rtlock acquire failed
int r2 = *owner; // mark waiters
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1); // mark waiters succeeds
smp_mb(); // mark waiters
int r4 = xchg_acquire(owner,r0); // set owner
int r5 = READ_ONCE(*x); // critical section
}

// 0:r1 = 0: rtlock acquire succeeds
// 0:r3 = 2: rtlock release succeeds
// 1:r1 = 2: rtlock acquire failed
// 1:r3 = 1:r2: mark waiters succeeds
// 1:r5 = 0: critical section violated
exists (0:r1 = 0 /\ 0:r3 = 2 /\ 1:r1 = 2 /\ 1:r3 = 1:r2 /\ 1:r5 = 0)

herd7 reports a race related to 'mark waiters'. Below is the fix I proposed in https://lkml.org/lkml/2023/1/20/702.

P0(int *owner, int *x) {
int r0 = 2;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*x);
WRITE_ONCE(*x, r2 + 1);
int r3 = atomic_cmpxchg_release(owner,r0,0);
}

P1(int *owner, int *x) {
int r0 = 4;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*owner);
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1);
smp_mb();
int r4 = xchg_acquire(owner,r0);
int r5 = READ_ONCE(*x);
}

This code is correct (result is "No") and there are no races, but it has an unnecessary barrier. smp_mb() can be removed without affecting the result.

P0(int *owner, int *x) {
int r0 = 2;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*x);
WRITE_ONCE(*x, r2 + 1);
int r3 = atomic_cmpxchg_release(owner,r0,0);
}

P1(int *owner, int *x) {
int r0 = 4;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*owner);
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1);
int r4 = xchg_acquire(owner,r0);
int r5 = READ_ONCE(*x);
}

The other possibility is to keep the barrier, but relax the acquire store.

P0(int *owner, int *x) {
int r0 = 2;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*x);
WRITE_ONCE(*x, r2 + 1);
int r3 = atomic_cmpxchg_release(owner,r0,0);
}

P1(int *owner, int *x) {
int r0 = 4;
int r1 = atomic_cmpxchg_acquire(owner,0,r0);
int r2 = READ_ONCE(*owner);
int r3 = atomic_cmpxchg_relaxed(owner,r2,r2|1);
smp_mb();
WRITE_ONCE(*owner,r0);
int r5 = READ_ONCE(*x);
}

Once again the result is "No".

Regards,
Hernan