Re: xfs problems (possibly after upgrading from linux kernel2.6.27.10 to .14)

From: Dave Chinner
Date: Thu Feb 19 2009 - 01:19:41 EST


On Wed, Feb 18, 2009 at 10:36:59AM +0100, Carsten Aulbert wrote:
> Dave Chinner schrieb:
> > On Tue, Feb 17, 2009 at 03:49:16PM +0100, Carsten Aulbert wrote:
> >> Do you need more information or can I send these nodes into a re-install?
> >
> > More information. Can you get a machine into a state where you can
> > trigger this condition reproducably by doing:
> >
> > mount filesystem
> > touch /mnt/filesystem/some_new_file
> >
> > If you can get it to that state, and you can provide an xfs_metadump
> > image of the filesystem when in that state, I can track down the
> > problem and fix it.
>
> I can try doing that on a few machines, would a metadump help on a
> machine where this corruption occurred some time ago and is still in
> this state?

If you unmount the filesystem, mount it again and then touch a new
file and it reports the error again, then yes, a metadump woul dbe
great.

If the error doesn't show up after a unmount/mount, then I
can't use a metadump image to reproduce the problem.

> >> Feb 16 22:01:28 n0260 kernel: [1129250.851451] Filesystem "sda6": xfs_iflush: Bad inode 1176564060 magic number 0x36b5, ptr 0xffff8801a7c06c00
> >
> > However, this implies some kind of memory corruption is occurring.
> > That is reading the inode out of the buffer before flushing the
> > in-memory state to disk. This implies someone has scribbled over
> > page cache pages.
> >
> >
> >> Feb 17 05:57:44 n0463 kernel: [1156816.912129] Filesystem "sda6": XFS internal error xfs_btree_check_sblock at line 307 of file fs/xfs/xfs_btree.c. Caller 0xffffffff802dd15b
> >
> > And that is another buffer that has been scribbled over.
> > Something is corrupting the page cache, I think. Whether the
> > original shutdown is caused by the some corruption, i don't
> > know.
> >
>
> At least on two nodes we ran memtest86+ overnight and so far no error.

I don't think it is hardware related.

> >> plus a few more nodes showing the same characteristics
> >
> > Hmmmm. Did this show up in 2.6.27.10? Or did it start occurring only
> > after you upgraded from .10 to .14?
>
> As far as I can see this only happened after the upgrade about 14 days
> ago. What strikes me odd is that we only had this occurring massively on
> Monday and Tuesday this week.
>
> I don't know if a certain access pattern could trigger this somehow.

I suspect so. We've already had XFS trigger one bug in the new
lockless pagecache code, and the fix for that went in 2.6.27.11 -
between the good version and the version that you've been seeing
these memory corruptions on. I'm wondering if that fix exposed or
introduced another bug that you've hit....

Nick?

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/