RE: [PANIC, hyperv] BUG: unable to handle kernel paging request at ffff880077800004 (hv_ringbuffer_write)

From: Dexuan Cui
Date: Tue Aug 26 2014 - 06:32:18 EST


> -----Original Message-----
> From: Sitsofe Wheeler
> Sent: Tuesday, August 26, 2014 1:42 AM
> > > [ 7.645526] hv_vmbus: registering driver hyperv_fb
> > > [ 7.657553] BUG: unable to handle kernel paging request at
> > > ffff880077800004
> > > [ 7.658224] IP: [<ffffffff8159a7ac>] hv_ringbuffer_write+0x7c/0x150
> > > [ 7.658224] PGD 2da9067 PUD 2dac067 PMD 7fa27067 PTE
> > > 8000000077800060
> > > [ 7.658224] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> > It seems
> > hv_ringbuffer_write() ->
> > hv_get_ringbuffer_availbytes():
> > reading rbi->ring_buffer->read_index causes a page fault.
> >
> > It looks rbi->ring_buffer was unmapped somehow according to the
> > semantics of CONFIG_DEBUG_PAGEALLOC??? Or, was there a memory
> > corruption somewhere?
> >
> > It looks the panic will disappear if the guest isn't configured with a
> > "Network Adapter ".
IMO it has nothing to do with the hyperv netvsc, as here hypervfb is the first
one to invoke vmbus_open(), and hyperv netvsc's vmbus_open() hasn't been
invoked.

> This sounds very fishy as if network setup has left things in a bad
> state.
Ditto. I doubt the network driver causes the issue.

> What is baffles me is the whole UP vs SMP thing - why would UP
> make this show up consistently? Perhaps some assertions could be added
> to check that rbi->ring_buffer still has sane values in it after
> operations on it are finished?
With more tests, I found vcpus=2 has the same issue, despite a
small possibility.
vcpus=4 seems fine in my limited tests.

> I guess you could try switching things around and using
> kmemcheck (https://www.kernel.org/doc/Documentation/kmemcheck.txt ).
> If
> the whole area close to rbi->ring_buffer->read_index is being stomped on
> it should show up. If it's just being set to a duff value or freed that
> going to be harder to track down although poisoning before freeing
> should allow us to distinguish that case...
Thanks for the info.

Actually I found the direct cause of the panic:
sometimes vmbus_post_msg() can return 4 (HV_STATUS_INVALID_ALIGNMENT),
but vmbus_open() doesn't propagate this error to the caller
synthvid_connect_vsp(), and vmbus_open() " goto error1" and frees the
ringbuffer! So later the access to ring_buffer->read_index is caught by
CONFIG_DEBUG_PAGEALLOC.

I don't see any "invalid alignment" here... and I can't explain why vcpus=4
seems OK... Debugging WIP.

BTW, please try the attached patch.
With it, the VM doesn't panic in my side with vcpus=1 and can boot to
shell prompt(looks the boot-up is very slow. I have to wait for several minutes...)

> From your analysis this doesn't sound framebuffer related - perhaps we
> could drop the linuxfb CC's on these mails going forward?
OK. I removed linuxfb and Jean.

Thanks,
-- Dexuan

Attachment: fix_vmbus_open.patch
Description: fix_vmbus_open.patch