Re: 2.6.35-rc3: System unresponsive under load

From: Manfred Spraul
Date: Wed Jun 30 2010 - 15:07:17 EST

Hi Luca,

On 06/26/2010 06:47 PM, Luca Tettamanti wrote:

Confirmed here: your test program freezes the system for a while under
2.6.35-rc3, while vanilla 2.6.34 copes fine.
sysrq-t was responsive during the freeze, so I took a snapshot during
it, file is attached.

Ignore my test program:
If the master thread is interrupted in the right place, then there are 400 runnable tasks in the runqueue.
It seems that the scheduler just processes these 400 tasks first instead of the keventd/ksoftirqd that is necessary for the keyboard handling.

Attached is a new idea, could you try it with your httpd test?

Perhaps the race is actually a race in the user space:
The exit path of semtimedop() does not contain an explicit memory barrier.
For the kernel, it does not matter: It merely reads one integer value.
If sysret is also no memory barrier, then user space might observe stale data.

Which cpu do you have? I was unable to show any misbehavior on a Phenom X4.

From: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>

The last change to improve the scalability moved the actual wake-up out of
the section that is protected by spin_lock(sma->sem_perm.lock).

This means that IN_WAKEUP can be in queue.status even when the spinlock is
acquired by the current task. Thus the same loop that is performed when
queue.status is read without the spinlock acquired must be performed when
the spinlock is acquired.

In addition, user space may assume that semtimedop() is a memory
barrier(). Thus add a smp_mb() into the lockless return path - otherwise
the code would return after acquiring a semaphore without a memory
Thanks to kamezawa.hiroyu@xxxxxxxxxxxxxx for noticing lack of the memory


[akpm@xxxxxxxxxxxxxxxxxxxx: clean up kerneldoc, checkpatch warning and whitespace]

Signed-off-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Reported-by: Luca Tettamanti <>
Reported-by: Christoph Lameter <cl@xxxxxxxxxxxxxxxxxxxx>
Cc: Maciej Rutecki <maciej.rutecki@xxxxxxxxx>
diff --git a/ipc/sem.c b/ipc/sem.c
index 506c849..40a8f46 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -1256,6 +1256,33 @@ out:
return un;

+ * get_queue_result - Retrieve the result code from sem_queue
+ * @q: Pointer to queue structure
+ *
+ * Retrieve the return code from the pending queue. If IN_WAKEUP is found in
+ * q->status, then we must loop until the value is replaced with the final
+ * value: This may happen if a task is woken up by an unrelated event (e.g.
+ * signal) and in parallel the task is woken up by another task because it got
+ * the requested semaphores.
+ *
+ * The function can be called with or without holding the semaphore spinlock.
+ */
+static int get_queue_result(struct sem_queue *q)
+ int error;
+ error = q->status;
+ while (unlikely(error == IN_WAKEUP)) {
+ cpu_relax();
+ error = q->status;
+ }
+ return error;
SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
unsigned, nsops, const struct timespec __user *, timeout)
@@ -1409,15 +1436,18 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,

- error = queue.status;
- while(unlikely(error == IN_WAKEUP)) {
- cpu_relax();
- error = queue.status;
- }
+ error = get_queue_result(&queue);

if (error != -EINTR) {
/* fast path: update_queue already obtained all requested
- * resources */
+ * resources.
+ * Perform a smp_mb(): User space could assume that semop()
+ * is a memory barrier: Without the mb(), the cpu could
+ * speculatively read in user space stale data that was
+ * overwritten by the previous owner of the semaphore.
+ */
+ smp_mb();
goto out_free;

@@ -1427,10 +1457,12 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
goto out_free;

+ error = get_queue_result(&queue);
* If queue.status != -EINTR we are woken up by another process
- error = queue.status;
if (error != -EINTR) {
goto out_unlock_free;