Re: [PATCH] core_pattern: add CPU specifier
From: Renaud Métrich
Date: Thu Sep 08 2022 - 02:45:57 EST
Hello,
I have been working closely with Oleksandr on a couple of cases where
customers could see segfaults for various processes, including basic
tools ("grep", "cut", etc.) that usually don't die.
The coredumps showed of course nothing because from userland's
perspective there was nothing wrong, but just a bad pointer which
couldn't be explained.
Memory testing (e.g. Memtest86+) and CPU testing (usually from hardware
vendor) never showed any issue with the hardware as well, even though
there was, probably because it required special conditions, such as
specific load and/or thermal conditions.
The troubleshooting of such cases takes several weeks or even months,
until we have enough evidence it's not the OS that is faulty, and it's
always struggling.
Usually when we start getting kernel crashes, we are then happy because
kernel crashes indicate the CPU the task was running on, and it seems to
always be reliable enough information to point to faulty CPU. For other
cases where no kernel crash could be observed, these are solved after
requesting the customer to replace the hardware components, which is
something difficult to explain since it usually costs the customer money
and time.
I hope such feature will be helpful for everybody doing Linux support.
Renaud.
Le 9/7/22 à 17:53, Luis Chamberlain a écrit :
On Sat, Sep 03, 2022 at 08:43:30AM +0200, Oleksandr Natalenko wrote:
Statistically, in a large deployment regular segfaults may indicate a CPU issue.
Can you elaborate on this? How common is this observed to be true? Are
there any public findings or bugs where it showed this?
Luis
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature