sem_lock() vs qspinlocks

From: Davidlohr Bueso
Date: Fri May 20 2016 - 01:39:51 EST


Hi,

Giovanni ran into a pretty reproducible situation in which the libmicro benchmark[1]
shows a functional regression in sysv semaphores, on upstream kernels. Specifically
for the 'cascade_cond' and 'cascade_flock' programs, which exhibit hangs in libc's
semop() blocked waiting for zero. Alternatively, the following splats may appear:

[ 692.991258] BUG: unable to handle kernel NULL pointer dereference (null)
[ 692.992062] IP: [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[ 692.992062] PGD 862fab067 PUD 858bbc067 PMD 0
[ 692.992062] Oops: 0000 [#1] SMP
[ 692.992062] Modules linked in: ...
[ 692.992062] CPU: 18 PID: 7398 Comm: cascade_flock Tainted: G E 4.6.0-juancho2-default+ #18
[ 692.992062] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0030.R03.1405061547 05/06/2014
[ 692.992062] task: ffff88084a7e9640 ti: ffff880854748000 task.ti: ffff880854748000
[ 692.992062] RIP: 0010:[<ffffffff812a0a9f>] [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[ 692.992062] RSP: 0018:ffff88085474bce8 EFLAGS: 00010216
[ 692.992062] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88086cc3d0d0
[ 692.992062] RDX: ffff88086cc3d0d0 RSI: ffff88086cc3d0d0 RDI: ffff88086cc3d040
[ 692.992062] RBP: ffff88085474bce8 R08: 0000000000000007 R09: ffff88086cc3d088
[ 692.992062] R10: 0000000000000000 R11: 000000a1597ea64c R12: ffff88085474bd90
[ 692.992062] R13: ffff88086cc3d040 R14: 0000000000000000 R15: 00000000ffffffff
[ 692.992062] FS: 00007faa46a2d700(0000) GS:ffff88086e500000(0000) knlGS:0000000000000000
[ 692.992062] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 692.992062] CR2: 0000000000000000 CR3: 0000000862faa000 CR4: 00000000001406e0
[ 692.992062] Stack:
[ 692.992062] ffff88085474bf38 ffffffff812a2ac3 ffffffff810c3995 ffff88084a7e9640
[ 692.992062] 0000000000000000 ffffffff81cb1f48 0000000000000002 ffff880400038000
[ 692.992062] ffff88084a7e9640 ffff88085474bd40 fffffffffffffffc ffff88085474bd40
[ 692.992062] Call Trace:
[ 692.992062] [<ffffffff812a2ac3>] SYSC_semtimedop+0x833/0xc00
[ 692.992062] [<ffffffff810c3995>] ? __wake_up_common+0x55/0x90
[ 692.992062] [<ffffffff811f25c0>] ? kmem_cache_alloc+0x1e0/0x200
[ 692.992062] [<ffffffff81266b8b>] ? locks_alloc_lock+0x1b/0x70
[ 692.992062] [<ffffffff81266f33>] ? locks_insert_lock_ctx+0x93/0xa0
[ 692.992062] [<ffffffff81268594>] ? flock_lock_inode+0xf4/0x220
[ 692.992062] [<ffffffff81269cd7>] ? locks_lock_inode_wait+0x47/0x160
[ 692.992062] [<ffffffff811f25c0>] ? kmem_cache_alloc+0x1e0/0x200
[ 692.992062] [<ffffffff81266b8b>] ? locks_alloc_lock+0x1b/0x70
[ 692.992062] [<ffffffff81266d0f>] ? locks_free_lock+0x4f/0x60
[ 692.992062] [<ffffffff812a3340>] SyS_semop+0x10/0x20
[ 692.992062] [<ffffffff81639c32>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[ 692.992062] Code: 00 55 8b 47 7c 48 89 e5 85 c0 75 53 48 8b 4f 48 4c 8d 4f 48 4c 39 c9 48 8b 11 48 89 ce 75 08 eb 36 48 89 d1 48 89 c2 48 8b 41 28 <0f> b7 00 48 c1 e0 06 48 03 47 40 4c 8b 40 18 48 89 70 18 48 83
[ 692.992062] RIP [<ffffffff812a0a9f>] unmerge_queues+0x2f/0x70
[ 692.992062] RSP <ffff88085474bce8>
[ 692.992062] CR2: 0000000000000000
[ 693.882179] ---[ end trace 5605f108ab79cdb2 ]---

Or,

[ 463.567641] BUG: unable to handle kernel paging request at fffffffffffffffa
[ 463.576246] IP: [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
[ 463.584553] PGD 1c0d067 PUD 1c0f067 PMD 0
[ 463.590071] Oops: 0000 [#1] SMP
[ 463.594667] Modules linked in: ...
[ 463.664710] Supported: Yes
[ 463.668682] CPU: 6 PID: 2912 Comm: cascade_cond Not tainted 4.4.3-29-default #1
[ 463.677230] Hardware name: SGI.COM C2112-4GP3/X10DRT-P, BIOS 1.0b 04/07/2015
[ 463.685588] task: ffff88105dba0b40 ti: ffff8808fc7e0000 task.ti: ffff8808fc7e0000
[ 463.694366] RIP: 0010:[<ffffffff8126dcbf>] [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
[ 463.705084] RSP: 0018:ffff8808fc7e3c60 EFLAGS: 00010217
[ 463.711610] RAX: 0000000000000000 RBX: ffff88085d22dae0 RCX: 000000005732f1e7
[ 463.719952] RDX: fffffffffffffffa RSI: ffff88085d22dad0 RDI: ffff88085d22da80
[ 463.728270] RBP: 0000000000000000 R08: 00000000fffffff7 R09: 0000000000000000
[ 463.736561] R10: 0000000000000000 R11: 0000000000000206 R12: ffff88085d22da88
[ 463.745001] R13: ffff88085d22dad0 R14: ffffffffffffffc0 R15: ffff88085d22da40
[ 463.753450] FS: 00007f30fd9e5700(0000) GS:ffff88085fac0000(0000) knlGS:0000000000000000
[ 463.762684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 463.769731] CR2: fffffffffffffffa CR3: 000000017bc09000 CR4: 00000000001406e0
[ 463.778039] Stack:
[ 463.781130] ffff8808fc7e3d50 ffff88085d22dad0 ffffffff8126dfe1 ffffffff4d226800
[ 463.789704] ffff88085d22da80 0000000000000001 ffff88085d22da88 0000000000000001
[ 463.798254] 0000000000000001 ffff88085d22da40 ffff8808fc7e3d50 ffff8808fc7e3da0
[ 463.806758] Call Trace:
[ 463.810305] [<ffffffff8126dfe1>] update_queue+0xa1/0x180
[ 463.816706] [<ffffffff8126eb95>] do_smart_update+0x45/0xf0
[ 463.823276] [<ffffffff8126f141>] SYSC_semtimedop+0x3d1/0xb00
[ 463.830035] [<ffffffff815cf66e>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 463.838608] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x71
[ 463.846331]
[ 463.848747] Leftover inexact backtrace:
[ 463.848747]
[ 463.855853] Code: 80 00 00 81 f9 ff ff 00 00 0f 87 98 00 00 00 66 45 89 0b 48 83 c2 06 44
89 00 48 39 ea 72 99 48 83 ea 06 8b 4e 20 49 39 d2 77 16 <0f> b7 02 48 83 ea 06 48 c1 e0 06
48 03 07 49 39 d2 89 48 04 76
[ 463.877668] RIP [<ffffffff8126dcbf>] perform_atomic_semop.isra.5+0xcf/0x170
[ 463.885725] RSP <ffff8808fc7e3c60>
[ 463.890145] CR2: fffffffffffffffa
[ 463.894338] ---[ end trace 0b29cae12f0e401c ]---

From both I've reach the same conclusion that the pending operations array is getting
corrupted (sop, struct sembuf), ie: for the second splat, being for perform_atomic_semop():

2b:* 0f b7 02 movzwl (%rdx),%eax <-- trapping instruction

sma->sem_base[sop->sem_num].sempid = pid;

2e: 48 83 ea 06 sub $0x6,%rdx
32: 48 c1 e0 06 shl $0x6,%rax
36: 48 03 07 add (%rdi),%ra
39: 49 39 d2 cmp %rdx,%r10
3c: 89 48 04 mov %ecx,0x4(%rax)
3f: 76 .byte 0x76

libc's semop()'s mainly distributes simple and complex ops on the set fairly evenly acting
on a unique set, ie:

semop(semid: 884736, tsops: 0x7fffd1567bc0, nsops: 1);
semop(semid: 884736, tsops: 0x7fffd1567bc0, nsops: 2);

Given that this suggests a broken sem_lock(), I gave -next's c4b7bb08c295 (ipc/sem.c: Fix
complex_count vs. simple op race) a try, but unfortunately does not fix the issue. In fact
the regression is bisectable to the introduction of qspinlocks -- as of v4.2), with:

1bf7067 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Considering how sem_lock() plays games with spin_is_locked() and spin_unlock_wait() to
enable finer grained locking (as opposed to the whole set), the regression seems to be
introduced due to the fact that spin_unlock_wait() with qspinlocks only checks the first
least-signficant byte, therefore ignoring pending waiters. Fair enough, given that a simple
write to that byte is enough to release the lock. However, this is semantically different to
what was previously done with ticket locks in that spin_unlock_wait() will always observe
all waiters by adding itself to the tail. For sysvsems this could cause sem_wait_array() to
possibly miss any pending waiters on the sem->lock when a thread is trying to acquire the
global lock, which could iterate over that specific lock in the semaphore set, and shortly
thereafter the pending waiter takes the already iterated semaphore.

As such, the following restores the behavior of the ticket locks and 'fixes'
(or hides?) the bug in sems. Naturally incorrect approach:

@@ -290,7 +290,8 @@ static void sem_wait_array(struct sem_array *sma)

for (i = 0; i < sma->sem_nsems; i++) {
sem = sma->sem_base + i;
- spin_unlock_wait(&sem->lock);
+ while (atomic_read(&sem->lock))
+ cpu_relax();
}
ipc_smp_acquire__after_spin_is_unlocked();
}

While the differences between both versions wrt unlock_wait() are certainly there, ultimately,
I agree that code should not rely on the semantics of spinlock waiters -- and therefore sems
need fixing. Note of course lockref is the exception to this in how queued_spin_value_unlocked()
is implemented. In addition, this makes me wonder if queued_spin_is_locked() should then be:

- return atomic_read(&lock->val);
+ return atomic_read(&lock->val) & _Q_LOCKED_MASK;

And avoid considering pending waiters as locked.

Thoughts?

Thanks,
Davidlohr

[1] https://hg.java.net/hg/libmicro~hg-repo