Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

From: Michael S. Tsirkin
Date: Thu Mar 27 2014 - 04:10:55 EST


On Thu, Mar 27, 2014 at 08:36:57AM +0100, Markus Armbruster wrote:
> "Michael S. Tsirkin" <mst@xxxxxxxxxx> writes:
>
> > On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote:
> >> Hi List!
> >> Hope some one can help me, we had a big issue in our cloud the other
> >> day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
> >> went read only filesystem from the guest side because the backing
> >> files directory (the openstack _base directory) was compromised and
> >> the data was lost, when we realized the data was lost, it took us 5
> >> mins to restore the backup of the backing files, but by that time all
> >> the kvm guests received some kind of IO error from the hypervisor
> >> layer, and went read only on root filesystem.
> >>
> >> My question would be, is there a way to hold the IO operations against
> >> the backing files ( i thought that would be 99% READ operations ) for
> >> a little longer ( im asking this because i dont quite understand what
> >> is the process and when it raises the error ) in a case the backing
> >> files are missing (no IO possible) but is recoverable within minutes ?
> >>
> >> Any tip on how to achieve this if possible, or information about how
> >> backing files works on kvm, will be amazing.
> >> Waiting for feedback!
> >>
> >> kindest regards.
> >> Alejandro Comisario
> >
> >
> > I'm guessing this is what happened: guests timed out meanwhile.
> > You can increase the timeout within the guest:
> > echo 600 > /sys/block/sda/device/timeout
> > to timeout after 10 minutes.
> >
> > If you have installed qemu guest agent on your system, you can do this
> > from the host. Unfortunately by default it's memory can be pushed out to swap
> > and then on disk error access there might will fail :(
> > Maybe we should consider mlock on all its memory at least as an option.
> >
> > You could pause your guests, restart them after the issue is resolved,
> > and we could I guess add functionality to pause VM on disk errors
> > automatically.
> > Stefan?
>
> Would -drive rerror=stop do?

I think it will. It's a pity it doesn't appear in --help output -
would make it easier to find.

--
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/