Ingo Molnar wrote:
PREEMPT_RT on SMP systems triggered weird (very high) load average
values rather easily, which turned out to be a mainline kernel
->nr_uninterruptible handling bug in try_to_wake_up().
the following code:
if (old_state == TASK_UNINTERRUPTIBLE) {
old_rq->nr_uninterruptible--;
potentially executes with old_rq potentially being != rq, and hence
updating ->nr_uninterruptible without the lock held. Given a
sufficiently concurrent preemption workload the count can get out of
whack and updates might get lost, permanently skewing the global count. Nothing except the load-average uses nr_uninterruptible() so this
condition can go unnoticed quite easily.
Hi Ingo,
Yes you're right.
I have another idea. Revert back to the old code, then just transfer
the nr_uninterruptible count when migrating a task. That way, the
rq's nr_uninterruptible field always is a measure of the number of
uninterruptible tasks on it. What do you think?