[BUG] File system corruption with 4.4-rc3 and beyond

From: Steven Rostedt
Date: Tue Dec 22 2015 - 19:10:35 EST


OK, I started with 4.4-rc4 to add some urgent ftrace patches and
started testing. My tests started to fail, and then I noticed they
failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed
that I was constantly getting messages like this:

ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen
ata2.00: irq_stat 0x20000000, host bus error
ata2: SError: { HostInt }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata2.00: configured for UDMA/100
ata2: EH complete


The test box has a relatively new mobo and such, but I know the HD was
old. So I thought that the HD was simply failing. I installed a new HD
and spent lots of time since last Thursday trying to set it up to work
with my testing scripts. Unfortunately, I installed a newer Fedora that
no longer supported the older grub1 and I wasted lots of time trying to
get grub2 to do what I wanted. I finally gave up and used
syslinux/extlinux and got it working again. Unfortunately, I still got
these ata2 errors! I started thinking that the mobo may be bad.

But then I decided to try an older kernel, and the errors never showed
up. I booted back and forth several times and the errors were very
reliable. I have multiple OSes on this box so every time I got an
error, I would boot into one of the other OSes and do fsck on the
filesystems. Because the longer I ran my tests with this bug, it would
eventually start corrupting the ext4 filesystem.

Since it seemed very reliable, I started my bisect. It came down to this
patch:

From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@xxxxxxxxxxxxx>
Date: Tue, 24 Nov 2015 10:35:29 +0800
Subject: [PATCH] block: fix segment split


I thought this strange, because I don't see anything wrong with this
patch. But if I removed it, the problem went away, and when I added it
back, the problem would show up easily.

I checkout v4.4-rc6 and tested again, thinking something else may be
wrong and has since been fixed. Nope, the error still showed up. I then
removed this commit and tried again. Sure enough, the problem went away!


My guess is that there's another bug lurking around somewhere, and the
bug that this patch fixed hid the problem. Now that this patch fixed a
bug that would hide the issue, the issue is showing up.

I'll pass this along to the block experts and see what you can think of
it. I attached my config, and the test was a script that stress
trace-cmd filters.

Oh, and I ran this on my i386 kernel and OS. I haven't tried testing
much on x86_64 as my tests start with i386. It originally had issues in
x86_64 but that may be because the i386 test corrupted the filesystem
which is shared.

There may be a 32bit vs 64bit issue somewhere?

I spent way too much time on this. I'll try testing x86_64 after the
new year, if needed.

Merry X-mas and happy holidays!

-- Steve

Attachment: config-bad
Description: Binary data