4.9.130: CPU soft lockups and other weird memory errors

From: Christoph Anton Mitterer
Date: Tue Apr 09 2019 - 10:21:53 EST


Hey.

Perhaps anyone can help with the following, which is a problem at a
mass storage system cluster at the physics faculty here:

The cluster consists of 40 nodes all running Debian stable with a
4.9.130 kernel serving some ~3 PiB storage via 10GbE networking.
Part of the nodes are some Dell PowerEdges/PowerVaults, the others are
HP ProLiant DL380 Gen9.
All of them have basically the same configuration (except of course
obvious things like IP addresses, etc.) and all should have plenty
memory (the HPs 64 GiB, the Dells 32 GiB).

The following two(?) problems occur only on the HP nodes (which is IMO
some indication that it's a hardware/kernel problem):



HP nodes regularly get stuck with either some strange memory or CPU
soft lockup errors being printed endlessly to the serial console (see
attached files for some examples):

When this starts to happen, the system may come back a few times for
some seconds but then it usually ends up in an endless loop of these
errors out of which only a hard reset helps (everything else like
serial console, ssh no longer reacts).

The problem seems to occur whenever system load goes up, especially
"higher" network load seems to cause the issue.
I say "higher" because it doesn't seem having to be that much. One
example of a node that crashed today, had a 1/5/15 min load of ~60 and
something between 40-60 MB/s of received bytes (and basically nothing
sent).


Any idea on how to fix that respectively further trace it down would be
highly appreciated.


Cheers,
Chris.

Attachment: mem1.log.xz
Description: application/xz

Attachment: mem2.log.xz
Description: application/xz

Attachment: mem3.log.xz
Description: application/xz

Attachment: mem-followed-by-softlockup.log.xz
Description: application/xz

Attachment: soft-lockup1.log.xz
Description: application/xz

Attachment: soft-lockup2.log.xz
Description: application/xz

Attachment: soft-lockup3.log.xz
Description: application/xz