Re: Hyper-V stalls on device errors

From: Sitsofe Wheeler
Date: Tue Apr 30 2013 - 12:12:04 EST

Next message: Pavel Emelyanov: "[PATCH 3/5] pagemap: introduce pagemap_entry_t without pmshift bits"
Previous message: Pavel Emelyanov: "[PATCH 2/5] clear_refs: introduce private struct for mm_walk"
In reply to: Sitsofe Wheeler: "Hyper-V stalls on device errors"
Next in thread: KY Srinivasan: "RE: Hyper-V stalls on device errors"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Apologies for the previous empty mail.

While testing a Windows 2012 host with a Fedora 18 guest running a 3.9
kernel I've found that Hyper-v will stall all access to
(para)virtualised disk devices when an underlying disk device returns an
error. Every ten seconds a tiny bit of I/O goes through before being
stalled again and it plays havoc with asynchronous I/O to disk devices
too.

To produce this I created a device mapper device with a single error in
it by using

dd if=/dev/zero of=/tmp/fakeblock0 bs=100M count=1
losetup --find --show /tmp/fakeblock0
# Assuming losetup uses /dev/loop0
cat << EOF | dmsetup create oneerror
0 13443 linear /dev/loop0 0
13443 1 error
13444 191356 linear /dev/loop0 0
EOF

After installing scsi-target-utils the /dev/mapper/oneerror device was
then turned into a iSCSI target by adding

<target iqn.2013-04.com.stormagic:oneerror>
backing-store /dev/mapper/oneerror
write-cache off
</target>

to /etc/tgt/targets.conf . The iSCSI target service was started with
systemctl start tgtd.service (watch out for
https://bugzilla.redhat.com/show_bug.cgi?id=848942 and you may need to
disable the firewall by using systemctl stop firewalld.service ).

The Windows 2012 iSCSI initiator was used to add the target to the
machine with the hypervisor (the usual discovery should work to the
Linux box serving the SCSI target). Once done, this disk was then added
to the Linux guest's Hyper-V settings via the SCSI controller. A spare
IDE controller disk was also added.

In the Linux guest a badblock run was started on the spare IDE disk
block device so that I/O was visible. A
dd if=/dev/zero of=/dev/sdc oflag=direct
(where /dev/sdc is the erroring block device that was added earlier) was
then done to trigger the access of the bad sector.

The following appeared in dmesg:

[ 160.718836] hv_storvsc vmbus_0_12: cmd 0x2a scsi status 0x2 srb status 0x4
[ 170.991312] hv_storvsc vmbus_0_12: cmd 0x2a scsi status 0x2 srb status 0x4
[ 181.039597] hv_storvsc vmbus_0_12: cmd 0x2a scsi status 0x2 srb status 0x4
[ 191.081242] hv_storvsc vmbus_0_12: cmd 0x2a scsi status 0x2 srb status 0x4
[ 201.116790] hv_storvsc vmbus_0_12: cmd 0x2a scsi status 0x2 srb status 0x4
[ 211.127741] hv_storvsc vmbus_0_12: cmd 0x2a scsi status 0x2 srb status 0x4
[ 221.140338] sd 3:0:0:2: [sdc] Unhandled error code
[ 221.140346] sd 3:0:0:2: [sdc]
[ 221.140349] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 221.140352] sd 3:0:0:2: [sdc] CDB:
[ 221.140354] Write(10): 2a 00 00 00 34 00 00 01 00 00
[ 221.140366] end_request: critical target error, dev sdc, sector 13312

A Fedora 18 guest on VMWare ESXi returned the error in under a second
and only had the following in dmesg:

[ 293.917383] sd 2:0:1:0: [sdb] Unhandled sense code
[ 293.917391] sd 2:0:1:0: [sdb]
[ 293.917394] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 293.917408] sd 2:0:1:0: [sdb]
[ 293.917414] Sense Key : Medium Error [current]
[ 293.917418] sd 2:0:1:0: [sdb]
[ 293.917421] Add. Sense: Unrecovered read error
[ 293.917424] sd 2:0:1:0: [sdb] CDB:
[ 293.917428] Write(10): 2a 00 00 00 34 00 00 04 00 00
[ 293.917436] end_request: critical target error, dev sdb, sector 13312

The stalls do not occur when the bad block device is created directly in
the Linux guest. From the previous log messages it looks like Hyper-V
is trying for up to a minute before returning an error and the I/O
stalls to separate (but virtualised) devices on different buses looks
like an unintended side effect...

--
Sitsofe | http://sucs.org/~sits/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Pavel Emelyanov: "[PATCH 3/5] pagemap: introduce pagemap_entry_t without pmshift bits"
Previous message: Pavel Emelyanov: "[PATCH 2/5] clear_refs: introduce private struct for mm_walk"
In reply to: Sitsofe Wheeler: "Hyper-V stalls on device errors"
Next in thread: KY Srinivasan: "RE: Hyper-V stalls on device errors"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]