Re: Major KVM issues with kernel 4.5 on the host

From: Marc Haber
Date: Thu Apr 21 2016 - 16:04:45 EST


On Thu, Apr 21, 2016 at 06:51:06PM +0200, Borislav Petkov wrote:
> On Thu, Apr 21, 2016 at 04:50:05PM +0200, Marc Haber wrote:
> > What bothers me is that since I ended up with a "suspect" commit that
> > actually results in a "good" kernel (running for 22 hours now), I must
> > have said "bad" to an actually "good" kernel, which means that I had
> > an unrelated crash or corruption. Is that reasoning correct?
>
> Hmm, did that "unrelated crash or corruption" have the same symptoms as
> the original one?

Yes, but there are two symptoms. The VM either suffers file system
issues (garbage read from files, or an aborted ext4 journal and
following ro remount) or it stops dead in its tracks.


> > That one qualified as "good" six days ago. I'll retry, maybe I just
> > didn't wait long enough.
>
> So if the trigger time is varying so much, I'd try to double that to
> make sure I'm fairly certain about each commit I'm testing.

The longest trigger time I have seen was three hours, I tripled that
to nine hours, that probably was not enough.

> Also, this is a single box we're talking about, right? And you're sure
> it hasn't had any corruption issues so far?

It is a single box, and it runs perfectly with kernel 4.4.

> I see you have amd64_edac loading, so it must have ECC DIMMs. Have you
> had any reports in the past of ECC errors in dmesg? Or other MCEs,
> lockups, etc? Can you grep your logs for stuff like "hardware error",
> "mce", "edac" etc? Do a case-insensitive search.

The box reports about one correctable error per week, so I probably
have a faulty DIMM, but since the issue only surfaces in VMs while the
host system is in perfect working order...

And yes, I am pondering to simply replace the box with an Intel CPU.

I see "mce: CPU supports 6 MCE banks" once for each reboot, and about
30 "Machine check events logged" since January. How do I see which
events were logged?

> > "Trying" means make oldconfig, make deb-pkg in my case right? Does it
> > matter what I answer to the numerous config questions that keep coming
> > up during the oldconfig step?
>
> What I do is:
>
> $ git bisect <good|bad>
>
> to mark the current commit after having tested it. Then I do
>
> $ yes "" | make oldconfig
>
> to set the new config options.

So you basically select the default for new options.

> Then
>
> $ make -j7
> $ make modules_install install
>
> and reboot into the new kernel. Kernel name will possibly change each
> time so I write down on paper which kernel I'm testing.

I go the way of Debian packages since it is easier to handle the
crypto file systems when the machine is booting up.

And yes, I think about doing a test reinstall on unencrypted disk to
find out whether encryption plays a role, but I currently need the
machine to urgently to take it out of serice for half a month, and,
again, the host system is in perfect working order, it is just VMs
that barf.

> You can verify when booting it by doing:
>
> $ dmesg | head
> [ 0.000000] Linux version 4.6.0-rc2+ (boris@pd) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP PREEMPT Wed Apr 6 20:22:51 CEST 2016
> ...
>
> that date at the end of the line and number "#1" should be current.

I check the date of the package I am installing and the date stamp of
the kernels being installed to /boot. I'm reasonably sure I have that
under control.

> > Would it help to explicitly mark
> > 0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge
> > gained during the last week is not completely lost?
>
> I'd do the whole thing again, just to be sure.
>
> I know, bisection is very time-consuming :-\ And it is particularly
> annoying if it is done on the box I'm normally using daily.

... and if testing a "good" kernel means a day.

> > So I need to git log | grep 46896c73c1a4 and apply the patch again
> > each time the commit is found?
>
> I think you can let git do that for ya:
>
> $ git branch --contains 46896c73c1a4
> * (HEAD detached at 46896c73c1a4)
>
> that lists that the current checked out HEAD contains that commit. If you do
>
> $ git checkout 46896c73c1a4~1
>
> then that "(HEAD detached..." line is not in the list of branches
> containing it.

And whenever 46896c73c1a4 is present, I need to apply Paolo's patch,
right?

Greetings
Marc

--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany | lose things." Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421