Re: 4.4: INFO: rcu_sched self-detected stall on CPU

From: Steven Haigh
Date: Tue Mar 29 2016 - 04:56:47 EST


On 26/03/2016 8:07 AM, Steven Haigh wrote:
> On 26/03/2016 3:20 AM, Boris Ostrovsky wrote:
>> On 03/25/2016 12:04 PM, Steven Haigh wrote:
>>> It may not actually be the full logs. Once the system gets really upset,
>>> you can't run anything - as such, grabbing anything from dmesg is not
>>> possible.
>>>
>>> The logs provided above is all that gets spat out to the syslog server.
>>>
>>> I'll try tinkering with a few things to see if I can get more output -
>>> but right now, that's all I've been able to achieve. So far, my only
>>> ideas are to remove the 'quiet' options from the kernel command line -
>>> but I'm not sure how much that would help.
>>>
>>> Suggestions gladly accepted on this front.
>>
>> You probably want to run connected to guest serial console ("
>> serial='pty' " in guest config file and something like 'loglevel=7
>> console=tty0 console=ttyS0,38400n8' on guest kernel commandline). And
>> start the guest with 'xl create -c <cfg>' or connect later with 'xl
>> console <domainID>'.
>
> Ok thanks, I've booted the DomU with:
>
> $ cat /proc/cmdline
> root=UUID=63ade949-ee67-4afb-8fe7-ecd96faa15e2 ro enforcemodulesig=1
> selinux=0 fsck.repair=yes loglevel=7 console=tty0 console=ttyS0,38400n8
>
> I've left a screen session attached to the console (via xl console) and
> I'll see if that turns anything up. As this seems to be rather
> unpredictable when it happens, it may take a day or two to get anything.
> I just hope its more than the syslog output :)

Interestingly enough, this just happened again - but on a different
virtual machine. I'm starting to wonder if this may have something to do
with the uptime of the machine - as the system that this seems to happen
to is always different.

Destroying it and monitoring it again has so far come up blank.

I've thrown the latest lot of kernel messages here:
http://paste.fedoraproject.org/346802/59241532

Interestingly, around the same time, /var/log/messages on the remote
syslog server shows:
Mar 29 17:00:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:00:01 zeus systemd: Starting user-0.slice.
Mar 29 17:00:01 zeus systemd: Started Session 1567 of user root.
Mar 29 17:00:01 zeus systemd: Starting Session 1567 of user root.
Mar 29 17:00:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:00:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:01:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:01:01 zeus systemd: Starting user-0.slice.
Mar 29 17:01:01 zeus systemd: Started Session 1568 of user root.
Mar 29 17:01:01 zeus systemd: Starting Session 1568 of user root.
Mar 29 17:08:34 zeus ntpdate[18569]: adjust time server 203.56.246.94
offset -0.002247 sec
Mar 29 17:08:34 zeus systemd: Removed slice user-0.slice.
Mar 29 17:08:34 zeus systemd: Stopping user-0.slice.
Mar 29 17:10:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:10:01 zeus systemd: Starting user-0.slice.
Mar 29 17:10:01 zeus systemd: Started Session 1569 of user root.
Mar 29 17:10:01 zeus systemd: Starting Session 1569 of user root.
Mar 29 17:10:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:10:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:20:01 zeus systemd: Created slice user-0.slice.
Mar 29 17:20:01 zeus systemd: Starting user-0.slice.
Mar 29 17:20:01 zeus systemd: Started Session 1570 of user root.
Mar 29 17:20:01 zeus systemd: Starting Session 1570 of user root.
Mar 29 17:20:01 zeus systemd: Removed slice user-0.slice.
Mar 29 17:20:01 zeus systemd: Stopping user-0.slice.
Mar 29 17:30:55 zeus systemd: systemd-logind.service watchdog timeout
(limit 1min)!
Mar 29 17:32:25 zeus systemd: systemd-logind.service stop-sigabrt timed
out. Terminating.
Mar 29 17:33:56 zeus systemd: systemd-logind.service stop-sigterm timed
out. Killing.
Mar 29 17:35:26 zeus systemd: systemd-logind.service still around after
SIGKILL. Ignoring.
Mar 29 17:36:56 zeus systemd: systemd-logind.service stop-final-sigterm
timed out. Killing.
Mar 29 17:38:26 zeus systemd: systemd-logind.service still around after
final SIGKILL. Entering failed mode.
Mar 29 17:38:26 zeus systemd: Unit systemd-logind.service entered failed
state.
Mar 29 17:38:26 zeus systemd: systemd-logind.service failed.

--
Steven Haigh

Email: netwiz@xxxxxxxxx
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

Attachment: signature.asc
Description: OpenPGP digital signature