On 06/19/2017 10:32 AM, Corey Minyard wrote:
On 06/19/2017 09:29 AM, Tony Camuso wrote:
On 06/19/2017 09:31 AM, Corey Minyard wrote:
On 06/16/2017 07:15 AM, Corey Minyard wrote:
On 06/15/2017 10:54 AM, Corey Minyard wrote:
On 06/13/2017 09:54 AM, Tony Camuso wrote:
A vendor with a system having more than 128 CPUs occasionally encounters a
crash during shutdown. This is not an easily reproduceable event, but the
vendor was able to provide the following analysis of the crash, which
exhibits the same footprint each time.
crash> bt
PID: 0 TASK: ffff88017c70ce70 CPU: 5 COMMAND: "swapper/5"
#0 [ffff88085c143ac8] machine_kexec at ffffffff81059c8b
#1 [ffff88085c143b28] __crash_kexec at ffffffff811052e2
#2 [ffff88085c143bf8] crash_kexec at ffffffff811053d0
#3 [ffff88085c143c10] oops_end at ffffffff8168ef88
#4 [ffff88085c143c38] no_context at ffffffff8167ebb3
#5 [ffff88085c143c88] __bad_area_nosemaphore at ffffffff8167ec49
#6 [ffff88085c143cd0] bad_area_nosemaphore at ffffffff8167edb3
#7 [ffff88085c143ce0] __do_page_fault at ffffffff81691d1e
#8 [ffff88085c143d40] do_page_fault at ffffffff81691ec5
#9 [ffff88085c143d70] page_fault at ffffffff8168e188
[exception RIP: unknown or invalid address]
RIP: ffffffffa053c800 RSP: ffff88085c143e28 RFLAGS: 00010206
RAX: ffff88017c72bfd8 RBX: ffff88017a8dc000 RCX: ffff8810588b5ac8
RDX: ffff8810588b5a00 RSI: ffffffffa053c800 RDI: ffff8810588b5a00
RBP: ffff88085c143e58 R8: ffff88017c70d408 R9: ffff88017a8dc000
R10: 0000000000000002 R11: ffff88085c143da0 R12: ffff8810588b5ac8
R13: 0000000000000100 R14: ffffffffa053c800 R15: ffff8810588b5a00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <IRQ stack> ---
[exception RIP: cpuidle_enter_state+82]
RIP: ffffffff81514192 RSP: ffff88017c72be50 RFLAGS: 00000202
RAX: 0000001e4c3c6f16 RBX: 000000000000f8a0 RCX: 0000000000000018
RDX: 0000000225c17d03 RSI: ffff88017c72bfd8 RDI: 0000001e4c3c6f16
RBP: ffff88017c72be78 R8: 000000000000237e R9: 0000000000000018
R10: 0000000000002494 R11: 0000000000000001 R12: ffff88017c72be20
R13: ffff88085c14f8e0 R14: 0000000000000082 R15: 0000001e4c3bb400
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
This is the corresponding stack trace
It has crashed because the area pointed with RIP extracted from timer
element is already removed during a shutdown process.
The function is smi_timeout().
And we think ffff8810588b5a00 in RDX is a parameter struct smi_info
crash> rd ffff8810588b5a00 20
ffff8810588b5a00: ffff8810588b6000 0000000000000000 .`.X............
ffff8810588b5a10: ffff880853264400 ffffffffa05417e0 .D&S......T.....
ffff8810588b5a20: 24a024a000000000 0000000000000000 .....$.$........
ffff8810588b5a30: 0000000000000000 0000000000000000 ................
ffff8810588b5a40: ffffffffa053a040 ffffffffa053a060 @.S.....`.S.....
ffff8810588b5a50: 0000000000000000 0000000100000001 ................
ffff8810588b5a60: 0000000000000000 0000000000000e00 ................
ffff8810588b5a70: ffffffffa053a580 ffffffffa053a6e0 ..S.......S.....
ffff8810588b5a80: ffffffffa053a4a0 ffffffffa053a250 ..S.....P.S.....
ffff8810588b5a90: 0000000500000002 0000000000000000 ................
Unfortunately the top of this area is already detroyed by someone.
But because of two reasonns we think this is struct smi_info
1) The address included in between ffff8810588b5a70 and ffff8810588b5a80:
are inside of ipmi_si_intf.c see crash> module ffff88085779d2c0
2) We've found the area which point this.
It is offset 0x68 of ffff880859df4000
crash> rd ffff880859df4000 100
ffff880859df4000: 0000000000000000 0000000000000001 ................
ffff880859df4010: ffffffffa0535290 dead000000000200 .RS.............
ffff880859df4020: ffff880859df4020 ffff880859df4020 @.Y.... @.Y....
ffff880859df4030: 0000000000000002 0000000000100010 ................
ffff880859df4040: ffff880859df4040 ffff880859df4040 @@.Y....@@.Y....
ffff880859df4050: 0000000000000000 0000000000000000 ................
ffff880859df4060: 0000000000000000 ffff8810588b5a00 .........Z.X....
ffff880859df4070: 0000000000000001 ffff880859df4078 ........x@.Y....
If we regards it as struct ipmi_smi in shutdown process
it looks consistent.
The remedy for this apparent race is affixed below.
I think you are right about this problem, but in_shutdown is checked already
a bit before when newmsg is extracted from the list. Wouldn't it be better
to add the rcu_read_lock() region starting right before the previous
in_shutdown check to after the send? That would avoid a leak in this
case.
While lying awake unable to sleep, I realized that you can't call the
sender function while holding rcu_read_lock(). That will break RT,
because you can't claim a mutex while holding rcu_read_lock(),
and the sender function will claim normal spinlocks.
So I need to think about this a bit.
I was wrong about this. An rcu_read_lock() around the whole thing should
be all that is required to fix this. I can do a patch, or you can, if you like.
Thanks again for pointing this out.
-corey
Is this what you have in mind?
---
drivers/char/ipmi/ipmi_msghandler.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index 9f69995..e20f8d7 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -3880,6 +3880,9 @@ static void smi_recv_tasklet(unsigned long val)
*/
if (!run_to_completion)
spin_lock_irqsave(&intf->xmit_msgs_lock, flags);
+
+ rcu_read_lock();
+
if (intf->curr_msg == NULL && !intf->in_shutdown) {
struct list_head *entry = NULL;
@@ -3894,11 +3897,15 @@ static void smi_recv_tasklet(unsigned long val)
newmsg = list_entry(entry, struct ipmi_smi_msg, link);
intf->curr_msg = newmsg;
}
+
+ if (newmsg)
+ intf->handlers->sender(intf->send_info, newmsg);
}
+
+ rcu_read_unlock();
+
if (!run_to_completion)
spin_unlock_irqrestore(&intf->xmit_msgs_lock, flags);
- if (newmsg)
- intf->handlers->sender(intf->send_info, newmsg);
handle_new_recv_msgs(intf);
}
No, you definitely cannot call the sender function while holding the lock.
I was talk about adding rcu_read_lock() before the spin_lock_irqsave()
and rcu_read_unlock() after the sender() call.
More like this, then.
diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c
index 9f69995..49a7e96 100644
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -3878,6 +3878,9 @@ static void smi_recv_tasklet(unsigned long val)
* because the lower layer is allowed to hold locks while calling
* message delivery.
*/
+
+ rcu_read_lock();
+
if (!run_to_completion)
spin_lock_irqsave(&intf->xmit_msgs_lock, flags);
if (intf->curr_msg == NULL && !intf->in_shutdown) {
@@ -3900,6 +3903,8 @@ static void smi_recv_tasklet(unsigned long val)
if (newmsg)
intf->handlers->sender(intf->send_info, newmsg);
+ rcu_read_unlock();
+
handle_new_recv_msgs(intf);
}
IIRC, this code was moved from a section that was under the rcu read
lock, but the lock was not added when moved.
-corey