Re: [BUG] File system corruption with 4.4-rc3 and beyond

From: Ming Lei
Date: Tue Dec 22 2015 - 20:14:35 EST


On Wed, Dec 23, 2015 at 8:16 AM, Jens Axboe <axboe@xxxxxx> wrote:
> On 12/22/2015 05:09 PM, Steven Rostedt wrote:
>>
>> OK, I started with 4.4-rc4 to add some urgent ftrace patches and
>> started testing. My tests started to fail, and then I noticed they
>> failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed
>> that I was constantly getting messages like this:
>>
>> ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen
>> ata2.00: irq_stat 0x20000000, host bus error
>> ata2: SError: { HostInt }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2: hard resetting link
>> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>> ata2.00: configured for UDMA/100
>> ata2: EH complete
>>
>>
>> The test box has a relatively new mobo and such, but I know the HD was
>> old. So I thought that the HD was simply failing. I installed a new HD
>> and spent lots of time since last Thursday trying to set it up to work
>> with my testing scripts. Unfortunately, I installed a newer Fedora that
>> no longer supported the older grub1 and I wasted lots of time trying to
>> get grub2 to do what I wanted. I finally gave up and used
>> syslinux/extlinux and got it working again. Unfortunately, I still got
>> these ata2 errors! I started thinking that the mobo may be bad.
>>
>> But then I decided to try an older kernel, and the errors never showed
>> up. I booted back and forth several times and the errors were very
>> reliable. I have multiple OSes on this box so every time I got an
>> error, I would boot into one of the other OSes and do fsck on the
>> filesystems. Because the longer I ran my tests with this bug, it would
>> eventually start corrupting the ext4 filesystem.
>>
>> Since it seemed very reliable, I started my bisect. It came down to this
>> patch:
>>
>> From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
>> From: Ming Lei <ming.lei@xxxxxxxxxxxxx>
>> Date: Tue, 24 Nov 2015 10:35:29 +0800
>> Subject: [PATCH] block: fix segment split
>>
>>
>> I thought this strange, because I don't see anything wrong with this
>> patch. But if I removed it, the problem went away, and when I added it
>> back, the problem would show up easily.
>>
>> I checkout v4.4-rc6 and tested again, thinking something else may be
>> wrong and has since been fixed. Nope, the error still showed up. I then
>> removed this commit and tried again. Sure enough, the problem went away!
>
>
> Probably the other way around, I think, it uncovered an issue with the
> segment counting for certain cases.

Diethard said the same case can be fixed by the patch 'block:
ensure to split after potentially bouncing a bio', so please just test it.

Also looks it is helpful to add a warning for the splitted bio in
bio_for_each_segment_all().

>
>> My guess is that there's another bug lurking around somewhere, and the
>> bug that this patch fixed hid the problem. Now that this patch fixed a
>> bug that would hide the issue, the issue is showing up.
>>
>> I'll pass this along to the block experts and see what you can think of
>> it. I attached my config, and the test was a script that stress
>> trace-cmd filters.
>>
>> Oh, and I ran this on my i386 kernel and OS. I haven't tried testing
>> much on x86_64 as my tests start with i386. It originally had issues in
>> x86_64 but that may be because the i386 test corrupted the filesystem
>> which is shared.
>>
>> There may be a 32bit vs 64bit issue somewhere?
>
>
> I'm guessing it's the same issue that was recently diagnosed, which would
> make sense if you hit this on 32-bit with highmem. Patch is pending, if you
> feel inclined, it'd be great if you could add this patch and retry:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/