Re: Filesystem kernel hangup, 2.6.3 (bad: scheduling while atomic!)

From: Christoph Hellwig
Date: Mon Feb 23 2004 - 07:21:48 EST


On Sun, Feb 22, 2004 at 04:49:41PM +0100, Mikael Wahlberg wrote:
> Description:
>
> On heavy FTP Load (About 1Gbit/s) running both reads and writes on two ServeRAID6m Raid5 controllers merged together to one filesystem with Raidtools we see the error below. The filesystem gets totally hanged up. Currently with XFS, but JFS gets the same problem (Actually even more often).

What does the JFS oops look like?

> Feb 22 15:00:53 mserv1 kernel: [<c011e54a>] __wake_up_common+0x3a/0x60
> Feb 22 15:00:53 mserv1 kernel: [<c011e5af>] __wake_up+0x3f/0x70

This doesn't make a lot of sense, there's only two mrlocks in XFS, and
they're in the inodes that have well-defined and understood lifetime rules.

OTOH the previos oops might have messed quite a bit up in your system.

did you run memtest86 on the box? do you some strange patches applied or
external modules loaded? What's your .config?

> Feb 22 15:00:54 mserv1 kernel: [<c011b150>] do_page_fault+0x0/0x523
> Feb 22 15:00:54 mserv1 kernel: [<c010baf5>] error_code+0x2d/0x38
> Feb 22 15:00:54 mserv1 kernel: [<c011e5b5>] __wake_up+0x45/0x70
> Feb 22 15:00:54 mserv1 kernel: [<c011e54a>] __wake_up_common+0x3a/0x60
> Feb 22 15:00:55 mserv1 kernel: [<c011e5af>] __wake_up+0x3f/0x70
> Feb 22 15:00:55 mserv1 kernel: [<c0259e62>] mrunlock+0x82/0xb0
> Feb 22 15:00:55 mserv1 kernel: [<c0259b00>] mraccessf+0xc0/0xe0
> Feb 22 15:00:55 mserv1 kernel: [<c023038e>] xfs_iunlock+0x3e/0x80
> Feb 22 15:00:55 mserv1 kernel: [<c023727b>] xfs_iomap+0x3bb/0x540
> Feb 22 15:00:55 mserv1 kernel: [<c0163fc7>] bio_alloc+0xd7/0x1c0
> Feb 22 15:00:55 mserv1 kernel: [<c025a17a>] map_blocks+0x7a/0x170
> Feb 22 15:00:55 mserv1 kernel: [<c025b40b>] page_state_convert+0x52b/0x6d0
> Feb 22 15:00:55 mserv1 kernel: [<c0236cb9>] xfs_imap_to_bmap+0x39/0x240
> Feb 22 15:00:55 mserv1 kernel: [<c025be48>] linvfs_release_page+0xa8/0xb0
> Feb 22 15:00:55 mserv1 kernel: [<c025bce0>] linvfs_writepage+0x60/0x120
> Feb 22 15:00:55 mserv1 kernel: [<c014990c>] shrink_list+0x41c/0x710
> Feb 22 15:00:55 mserv1 kernel: [<c0149df8>] shrink_cache+0x1f8/0x3d0
> Feb 22 15:00:55 mserv1 kernel: [<c01b3a00>] journal_stop+0x220/0x330
> Feb 22 15:00:55 mserv1 kernel: [<c014a6dc>] shrink_zone+0xbc/0xc0
> Feb 22 15:00:55 mserv1 kernel: [<c014a7a5>] shrink_caches+0xc5/0xe0
> Feb 22 15:00:55 mserv1 kernel: [<c014a87c>] try_to_free_pages+0xbc/0x190
> Feb 22 15:00:55 mserv1 kernel: [<c0143043>] __alloc_pages+0x203/0x370
> Feb 22 15:00:55 mserv1 kernel: [<c01431d5>] __get_free_pages+0x25/0x40

Hmm, from the trace it looks like ->release_page was called from a context
where we can't sleep. XFS defintily doesn't handle that, so the question
is whether the kernel should do it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/