Re: 2.4.23aa1 ext3 oops

From: Jamie Clark
Date: Tue Dec 16 2003 - 10:57:04 EST


Hi Marcelo,

Sorry - I had started another test run after the previous crash (to collect more forensics) and have been waiting for the oops. The last one took 4 or 5 days. Just now I see I have another crash in the same place so I will apply the reversions you suggested and start again.

My apologies - this is slow work. I'm attempting to find a new kernel to keep our busy fileserver happy and I've had a spell of bad luck for the past month or two. I'll be happy when I get two weeks of uptime with a few bonnies running.


Marcelo Tosatti wrote:

Jamie,

Did you try the patch I suggested for you to revert ?

On Thu, 11 Dec 2003, Jamie Clark wrote:



OK, no deadlock yet with 2.4.23aa1 however it oopsed under ext3_file_write() in __mark_inode_dirty().

Just to recap: this test is dual PIII, running several bonnie++ loads on an ext3+noatime+quota filesystem
mounted off

From the oops the fault happens on the last instruction of:

movl $0,8(%ebx)
movl $0,4(%edx)
movl 100(%edi),%eax movl %edx,4(%eax) <-- here

which appears to be this code in inode.c [line 221+]

if (!(inode->i_state & (I_LOCK|I_FREEING|I_CLEAR)) &&
!list_empty(&inode->i_hash)) {
list_del(&inode->i_list);
list_add(&inode->i_list, &sb->s_dirty);

After a quick browse of the assembler output the zeroing would appear to be part of the list_del inline, and edi seems to equate to &sb. If I have read that correctly then the oops happens at the beginning of
the list_add() inline and eax is the head of the s_dirty list - pointing into oblivion.

__mark_inode_dirty() does not appear to take sb_lock before adding to the s_dirty list. Could that
be the culprit? I'm completely unfamiliar with linux kernel so I might be way off here.

-Jamie

Andrea Arcangeli wrote:



On Tue, Nov 04, 2003 at 07:52:40PM +0800, Jamie Clark wrote:




I made the quick fix (disabling rq_mergeable) and started the load test.
Will let it run for a week or so.




does your later recent email means it deadlocked again even with this
disabled?

Could you try again with 2.4.23aa1 again just in case?





FYI an observation from my last test: the read latency seems to be much
improved and more consistent under this kernel (2.4.23pre6aa3, before
the oops and before this fix). The maximum latency seemed steady over
the whole test without any of the longish pauses that showed up under
2.4.19. Quite a difference.




nice to hear! thanks.









-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/