Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

From: Ming Lei
Date: Sun May 17 2015 - 21:27:03 EST


Hi Santosh,

Thanks for your report!

On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
<santosh.shilimkar@xxxxxxxxxx> wrote:
> Hi Ming Lei, Jens,
>
> While doing few tests with recent kernels with Xen Server,
> we saw guests(DOMU) disk image getting corrupted while booting it.
> Strangely the issue is seen so far only with disk image over ocfs2
> volume. If the same image kept on the EXT3/4 drive, no corruption
> is observed. The issue is easily reproducible. You see the flurry
> of errors while guest is mounting the file systems.
>
> After doing some debug and bisects, we zeroed down the issue with
> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
> that commit reverted the corruption goes away.
>
> Some more details on the test setup:
> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
> which includes commit b5dd2f6. Boot the Server.
> 2. On DOM0 file system create a ocfs2 volume
> 3. Keep the Guest(VM) disk image on ocfs2 volume.
> 4. Boot guest image. (xm create vm.cfg)

I am not familiar with xen, so is the image accessed via
loop block inside of guest VM? Is he loop block created
in DOM0 or guest VM?

> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
> You will see errors like below and after this boot, that file
> system/disk-image gets corrupted and mostly won't boot next time.

OK, that means the image is corrupted by VM booting.

>
> Trimmed Guest kernel boot log...
> --->
> EXT3-fs (dm-0): using internal journal
> EXT3-fs: barriers not enabled
> kjournald starting. Commit interval 5 seconds
> EXT3-fs (xvda1): using internal journal
> EXT3-fs (xvda1): mounted filesystem with ordered data mode
> Adding 1048572k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1
> across:1048572k
>
> [...]
>
> EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 804966: bad block
> 843250
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
> JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk
> of filesystem corruption in case of system crash.
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394
>
> [...]
>
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394
>
> [...]
>
> EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #777661:
> rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
>
> [...]
>
> automount[2605]: segfault at 4 ip b7756dd6 sp b6ba8ab0 error 4 in
> ld-2.5.so[b774c000+1b000]
> EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap -
> block_group = 34, block = 1114112
> EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap -
> block_group = 0, block = 221
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
> EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 709252: bad block
> 370280
> ntpd[2691]: segfault at 2563352a ip b77e5000 sp bfe27cec error 6 in
> ntpd[b777d000+74000]
> EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory
> #618360: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0,
> name_len=0
> EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #709178:
> rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 368277: bad block
> 372184
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620393
> --------------------
>
> From the debug of the actual data on the disk vs what is read by
> the guest VM, we suspect the *reads* are actually not going all
> the way to disk and possibly returning the wrong data. Because
> the actual data on ocfs2 volume at those locations seems
> to be non-zero where as the guest seems to be read it as zero.

Two big changes in the patchset are: 1) use blk-mq request based IO;
2) submit I/O concurrently(write vs. write is still serialized)

Could you apply the patch in below link to see if it can fix the issue?
BTW, this patch only removes concurrent submission.

http://marc.info/?t=143093223200004&r=1&w=2

>
> I tried few experiment without much success so far. One of the
> thing I suspected was "requests are now submitted to backend
> file/device concurrently so tried to move them under lo->lo_lock
> so that they get serialized. Also moved the blk_mq_start_request()
> inside the actual work like patch below. But it didn't help. Thought
> of reporting the issue to get more ideas on what could be going
> wrong. Thanks for help in advance !!
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 39a83c2..22713b2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1480,20 +1480,17 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> const struct blk_mq_queue_data *bd)
> {
> struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> + struct loop_device *lo = cmd->rq->q->queuedata;
>
> - blk_mq_start_request(bd->rq);
> -
> + spin_lock_irq(&lo->lo_lock);
> if (cmd->rq->cmd_flags & REQ_WRITE) {
> - struct loop_device *lo = cmd->rq->q->queuedata;
> bool need_sched = true;
>
> - spin_lock_irq(&lo->lo_lock);
> if (lo->write_started)
> need_sched = false;
> else
> lo->write_started = true;
> list_add_tail(&cmd->list, &lo->write_cmd_head);
> - spin_unlock_irq(&lo->lo_lock);
>
> if (need_sched)
> queue_work(loop_wq, &lo->write_work);
> @@ -1501,6 +1498,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> queue_work(loop_wq, &cmd->read_work);
> }
>
> + spin_unlock_irq(&lo->lo_lock);
> return BLK_MQ_RQ_QUEUE_OK;
> }
>
> @@ -1517,6 +1515,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
> if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
> goto failed;
>
> + blk_mq_start_request(cmd->rq);
> +
> ret = 0;
> __rq_for_each_bio(bio, cmd->rq)
> ret |= loop_handle_bio(lo, bio);
> --

I don't see the above change is necessary.

Thanks,
Ming
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/