RE: Need help on "Self Detected Stall on CPU"

From: Atul Kulkarni
Date: Thu Apr 30 2020 - 22:11:07 EST


Thank you sir for your guidance and quick response.

Let me introduce my colleagues Paul and Mikhail here (copied in CC). They would be taking actions based on your guidance in this email and may reach you with further queries.

Appreciate your support and help.

Thanks,
Atul

-----Original Message-----
From: Paul E. McKenney <paulmck@xxxxxxxxxx>
Sent: 01 May 2020 00:47
To: Atul Kulkarni <Atul.Kulkarni@xxxxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: Need help on "Self Detected Stall on CPU"

On Thu, Apr 30, 2020 at 06:47:20PM +0000, Atul Kulkarni wrote:
> Dear Sir,
>
> Hope you are doing well. I have watched your various conference videos and have read technical papers.
> We are facing an issue with CPU stall on our systems and I felt like there is no one better who can guide us on how we can deal with it.
>
> I have attached logs for your reference. Towards end I have run couple of sysreq commands and have taken crash dump using sysreq which may help provide additional information.
> Could you please guide us on how we could fix this issue or identify what is going wrong here?

Let's focus on the first few lines of your console message:

[20526.345089] INFO: rcu_preempt self-detected stall on CPU [20526.351110] 0-...: (1051 ticks this GP) idle=1fe/140000000000002/0 softirq=146268/146268 fqs=0
[20526.360163] (t=2101 jiffies g=96468 c=96467 q=2)
[20526.365535] rcu_preempt kthread starved for 2101 jiffies! g96468 c96467 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0

The last line contains the hint, namely "rcu_preempt kthread starved for
2101 jiffies!" If you don't let RCU's kernel threads run, then RCU CPU stall warnings are expected behavior.

The "RCU_GP_WAIT_FQS(3)" means that this kthread's last act was to sleep for three jiffies. As you can see from earlier in that same line, that was 2101 jiffies ago. The "->state=0x402" means that the scheduler believes that this kthread is blocked, that is not yet runnable.

The usual way this sort of thing happens is a timer problem, be it a hardware configuration problem, a timer-driver bug, an interrupt-handling problem, and so on. This sort of problem is especially common when bringing up new hardware or when modifying timer code or when modifying code on the interrupt/exception paths.

So the question to ask yourself is "Why is the timer wakeup not reaching this kthread?", with special attention to changed code and new hardware.

Thanx, Paul