Re: debug rsdl 0.33

From: Andy Whitcroft
Date: Sun Mar 25 2007 - 08:27:52 EST


Con Kolivas wrote:
> On Saturday 24 March 2007 08:45, Con Kolivas wrote:
>> On Friday 23 March 2007 23:28, Andy Whitcroft wrote:
>>> Andy Whitcroft wrote:
>>>> Con Kolivas wrote:
>>>>> On Friday 23 March 2007 05:17, Andy Whitcroft wrote:
>>>>>> Ok, I have yet a third x86_64 machine is is blowing up with the
>>>>>> latest 2.6.21-rc4-mm1+hotfixes+rsdl-0.32 but working with
>>>>>> 2.6.21-rc4-mm1+hotfixes-RSDL. I have results on various hotfix
>>>>>> levels so I have just fired off a set of tests across the affected
>>>>>> machines on that latest hotfix stack plus the RSDL backout and the
>>>>>> results should be in in the next hour or two.
>>>>>>
>>>>>> I think there is a strong correlation between RSDL and these hangs.
>>>>>> Any suggestions as to the next step.
>>>>> Found a nasty in requeue_task
>>>>> + if (list_empty(old_array->queue + old_prio))
>>>>> + __clear_bit(old_prio, p->array->prio_bitmap);
>>>>>
>>>>> see anything wrong there? I do :P
>>>>>
>>>>> I'll queue that up with the other changes pending and hopefully that
>>>>> will fix your bug.
>>>> Tests queued with your rdsl-0.33 patch (I am assuming its in there).
>>>> Will let you know how it looks.
>>> Hmmm, this is good for the original machine (as was 0.32) but not for
>>> either of the other two. I am seeing panics as below on those two.
>> This machine seems most sensitive to it (first column):
>> elm3b6
>> amd64
>> newisys
>> 4cpu
>> config: amd64
>>
>> Can you throw this debugging patch at it please? The console output might
>> be very helpful. On top of sched-rsdl-0.33 thanks!
>
> Better yet this one which checks the expired array as well and after
> pull_task.
>
> If anyone's getting a bug they think might be due to rsdl please try this (on
> rsdl 0.33).

Ok, new round of tests across the sensitive machines with 0.33 plus the
above debug patch are in the queue. Will let you know how they pan out.

The tests with -rc4 + 0.33 are also in. Failing there also. Both out
of __sched_text_start, so I'd guess the same cause and the schedular is
fingered.

-apw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/