Re: 4.4: INFO: rcu_sched self-detected stall on CPU

From: Steven Haigh
Date: Tue May 03 2016 - 11:12:20 EST

On 03/05/16 06:54, gregkh@xxxxxxxxxxxxxxxxxxx wrote:
> On Wed, Mar 30, 2016 at 05:04:28AM +1100, Steven Haigh wrote:
>> Greg, please see below - this is probably more for you...
>> On 03/29/2016 04:56 AM, Steven Haigh wrote:
>>> Interestingly enough, this just happened again - but on a different
>>> virtual machine. I'm starting to wonder if this may have something to do
>>> with the uptime of the machine - as the system that this seems to happen
>>> to is always different.
>>> Destroying it and monitoring it again has so far come up blank.
>>> I've thrown the latest lot of kernel messages here:
>> So I just did a bit of digging via the almighty Google.
>> I started hunting for these lines, as they happen just before the stall:
>> BUG: Bad rss-counter state mm:ffff88007b7db480 idx:2 val:-1
>> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:0 val:-1
>> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:2 val:-1
>> I stumbled across this post on the lkml:
>> The patch attached seems to reference the following change in
>> unmap_mapping_range in mm/memory.c:
>>> - struct zap_details details;
>>> + struct zap_details details = { };
>> When I browse the GIT tree for 4.4.6:
>> I see at line 2411:
>> struct zap_details details;
>> Is this something that has been missed being merged into the 4.4 tree?
>> I'll admit my kernel knowledge is not enough to understand what the code
>> actually does - but the similarities here seem uncanny.
> I'm sorry, I have no idea what you are asking me about here. Did I miss
> a patch that should be backported? Did I backport something
> incorrectly?

Hi Greg + all,

I did actually find the cause of my rss-counter problems - being the
experimental PVH functionality in Xen. It caused a number of corruptions
both on disk and in memory. Turning this off resolved the problem.

As for the 'fix' above. It seems there was talk that zap_details should
be defined as { } to avoid a problem in newer versions of the kernel
that was in linux-next.

The question that I cannot answer (and I leave this open to the more
knowledgeable on the list than I) is if that fix should also be applied
to other trees.

So the question as I see it:
Is this an actual bug that we're just not seeing hit in other kernel
versions - but the newer oom reaper code from linux-next uncovered it -
or is the code as-is in the 4.4 tree considered correct?

It could well be that the experimental code in the Xen PVH was tickling
something that triggered the same type of issue as per the original bug
report leading to the patch quoted above.

Steven Haigh

Email: netwiz@xxxxxxxxx
Phone: (03) 9001 6090 - 0412 935 897

Attachment: signature.asc
Description: OpenPGP digital signature