Crash in crc32_le during kmemleak_scan()

From: vigneshr
Date: Wed May 13 2015 - 09:15:17 EST


We are seeing a panic in crc32_le after kmemleak_scan(). I have pasted the
snippet of the crash log below. This is seen on 3.10 Kernel.

This issue was earlier discussed over this thread as well -
https://lkml.org/lkml/2014/2/6/287.


<4>[ 70.732606] kmemleak: Cannot allocate a kmemleak_object structure
<6>[ 70.739803] kmemleak: Kernel memory leak detector disabled
<6>[ 70.740831] lowmemorykiller: Killing 'id.gms.wearable' (7004), adj 647,
<6>[ 70.740831] to free 29144kB on behalf of 'kswapd0' (95) because
<6>[ 70.740831] cache 216560kB is below limit 225000kB for
oom_score_adj 529
<6>[ 70.740831] Free memory is 5788kB above reserved.
<6>[ 70.740831] Free CMA is 107248kB
<6>[ 70.740831] Total reserve is 17856kB
<6>[ 70.740831] Total free pages is 113036kB
<6>[ 70.740831] Total file cache is 217632kB
<6>[ 70.740831] Slab Reclaimable is 17104kB
<6>[ 70.740831] Slab UnReclaimable is 117508kB
<6>[ 70.740831] Total Slab is 134612kB
<6>[ 70.740831] GFP mask is 0xd0
<6>[ 70.768877] lowmemorykiller: Killing 'le.android.talk' (7955), adj 529,
<6>[ 70.768877] to free 39124kB on behalf of 'kswapd0' (95) because
<6>[ 70.768877] cache 211088kB is below limit 225000kB for
oom_score_adj 529
<6>[ 70.768877] Free memory is 4716kB above reserved.
<6>[ 70.768877] Free CMA is 108048kB
<6>[ 70.768877] Total reserve is 17856kB
<6>[ 70.768877] Total free pages is 112764kB
<6>[ 70.768877] Total file cache is 212304kB
<6>[ 70.768877] Slab Reclaimable is 17104kB
<6>[ 70.768877] Slab UnReclaimable is 117408kB
<6>[ 70.768877] Total Slab is 134512kB
<6>[ 70.768877] GFP mask is 0xd0
<1>[ 70.935886] Unable to handle kernel paging request at virtual
address ffffff8001119000
<1>[ 70.942768] pgd = ffffffc00007d000
<1>[ 70.946150] [ffffff8001119000] *pgd=000000008c307003,
*pmd=00000000b2b96003, *pte=0000000000000000
<0>[ 70.955094] Internal error: Oops: 96000007 [#1] PREEMPT SMP
<6>[ 70.960648] Modules linked in: wlan(O)
<6>[ 70.964383] CPU: 2 PID: 198 Comm: kmemleak Tainted: G O
3.10.49-gcb3457b-00006-g72e5422 #1
<6>[ 70.973497] task: ffffffc030b24b40 ti: ffffffc03019c000 task.ti:
ffffffc03019c000
<6>[ 70.980966] PC is at crc32_le+0x78/0x140
<6>[ 70.984873] LR is at kmemleak_scan+0x2d0/0x3c8
<6>[ 70.989296] pc : [<ffffffc00044ea30>] lr : [<ffffffc0002f088c>]
pstate: 800001c5
<6>[ 70.996671] sp : ffffffc03019fd60
<6>[ 70.999972] x29: ffffffc03019fd70 x28: 0000000000000000
<6>[ 71.005267] x27: 0000000000000000 x26: 0000000000000000
<6>[ 71.010557] x25: ffffffc00168d6e8 x24: 0000000000000000
<6>[ 71.028876] x23: 0000000000000140 x22: ffffffc0014e23d8
<6>[ 71.034168] x21: ffffffc0014e23a0 x20: ffffffc0014e2338
<6>[ 71.039464] x19: ffffffc002a02640 x18: 000000000b220dc1
<6>[ 71.044757] x17: 0000000000000400 x16: 0000000000000800
<6>[ 71.050053] x15: 0000000000000c00 x14: 0000000000001400
<6>[ 71.064462] x13: 0000000000001800 x12: 0000000000001c00
<6>[ 71.069756] x11: 0000000000001000 x10: ffffffc0014c2000
<6>[ 71.075052] x9 : 0000000000000000 x8 : 0000000000000000
<6>[ 71.080347] x7 : ffffff8001118ffc x6 : 00000000000012aa
<6>[ 71.085642] x5 : 0000000000000000 x4 : ffffffc0014e23d8
<6>[ 71.100052] x3 : 0000000000000000 x2 : 00000000000012aa
<6>[ 71.105348] x1 : ffffff8001118ffc x0 : 0000000000000000
<6>[ 71.115590] e930 97fffff053107e60 a94153f333103e80 d65f03c0a8c27bfd
d280000453003c00 eb02009f90003f65 3864682354000120 91000484913ea0a6
12001c634a000063
<6>[ 71.138421] e970 4a4020607863d8c3 d65f03c017fffff7 d280000453003c00
eb02009fb0003f65 3864682354000120 910004849106a0a6 12001c634a000063
4a4020607863d8c3
<snip> <snip>
<6>[ 71.468288] fde0 ffffffc03019fe10 ffffffc00023bb1c ffffffc03302fd30
ffffffc001691738 ffffffc000f61dd0 ffffffc00023bb00 0000000000000000
ffffffc000203ba0
<6>[ 71.482001] fe20 ffffffc00023ba6c ffffffc03302fd30 0000000000000000
0000000000000000 0000000000000000 ffffffc000247048 ffffffc03019fea0
0000000000000000
<snip> <snip>
<6>[ 71.495716]
<0>[ 71.497201] Process kmemleak (pid: 198, stack limit =
0xffffffc03019c058)
<6>[ 71.503968] Call trace:
<6>[ 71.506402] [<ffffffc00044ea30>] crc32_le+0x78/0x140
<6>[ 71.511356] [<ffffffc0002f0cc0>] kmemleak_scan_thread+0x74/0xc0 <6>[
71.517249] [<ffffffc00023bb18>] kthread+0xac/0xb8
<0>[ 71.522027] Code: d281800f d2810010 d2808011 b4000466 (b94004e3)
<4>[ 71.528180] ---[ end trace 0387f1db53e12801 ]---
<0>[ 73.779697] Rebooting in 5 seconds..
<5>[ 78.782326] Going down for restart now


Analysis so far :

1. We see that this is during low memory conditions, when kmemleak_object
structure allocation fails (happens at around 70.732606 seconds after
boot).

2. Immediately as per code, we go ahead and call kmemleak_disable() which
is an irreversible operation and we see the message at 70.739803 seconds.

Now we see that kmemleak_scan() has caused the panic here at around
70.935886 seconds, even though disable was called earlier (0.2 seconds
before).

So the question was why it was getting triggered/running even after we
disable it.
So we started seeing if kmemleak_disable() bailed out and if yes, at what
stage via post mortem analysis.

1766 static void kmemleak_disable(void)
1767 {
1768 /* atomically check whether it was already invoked */
1769 if (cmpxchg(&kmemleak_error, 0, 1))
1770 return;
1771
1772 /* stop any memory operation tracing */
1773 kmemleak_enabled = 0;
1774
1775 /* check whether it is too early for a kernel thread */ 1776
if (kmemleak_initialized)
1777 schedule_work(&cleanup_work);
1778
1779 pr_info("Kernel memory leak detector disabled\n");
1780 }


1. From Line 1769, it would have bailed out if the kmemleak_error were
already set to 1 in which case , we would have already disabled it,
therefore this would have been 0 and then changed to 1.

2. From the post mortem raw dumps, we see that kmemleak_enabled was indeed
set to 0. Therefore we had successfully come here.

3. After the above, we see that we have scheduled a work - cleanup_work,
which does the following :

1747 static void kmemleak_do_cleanup(struct work_struct *work)
1748 {
1749 mutex_lock(&scan_mutex);
1750 stop_scan_thread();
1751
1752 if (!kmemleak_found_leaks)
1753 __kmemleak_do_cleanup();
1754 else
1755 pr_info("Kmemleak disabled without freeing internal data. "
1756 "Reclaim the memory with \"echo clear >
/sys/kernel/debug/kmemleak\"\n");
1757 mutex_unlock(&scan_mutex);
1758 }

If it had gone inside stop_scan_thread() from line 1750 routine, we should
have stopped the kmemleak kthread and the variable scan_thread should have
been null :

a. Postmortem analysis says otherwise.
b. We see that kmemleak thread did not stop since it executed after 0.2
seconds after we disabled.
c. Additionally we saw that scan_thread variable is perfectly intact and
not NULL. (snippet pasted below)

scan_thread = 0xFFFFFFC030B24B40 -> (
state = 0,
stack = 0xFFFFFFC03019C000,
usage = (counter = 2),
flags = 2129984,
ptrace = 0,
wake_entry = (next = 0x0),
on_cpu = 1,
on_rq = 1,
prio = 130,
static_prio = 130,
normal_prio = 130,
rt_priority = 0,
sched_class = 0xFFFFFFC000C03280,
se = (load = (weight = 110, inv_weight = 39045157), run_node =
(__rb_parent_
rt = (run_list = (next = 0xFFFFFFC030B24D08, prev =
0xFFFFFFC030B24D08), tim
ravg = (mark_start = 71528155165, sum = 8467458, demand = 10000000,
sum_hist
sched_task_group = 0xFFFFFFC00D53F200,
fpu_counter = 0,
policy = 0,
nr_cpus_allowed = 8,


Therefore, we checked the weird possibilty of the work function getting
stuck in acquiring scan_mutex lock which would explain the fact that
stop_scan_thread() did not execute and it
indeed turned out be that way.

Serving work : (struct work_struct *)0xFFFFFFC0014E23B0
work_func : \\vmlinux\kmemleak\kmemleak_do_cleanup
serving worker : kworker/3:1 PID : 32 on CPU : 3

Stack of kworker/3:1 :

-000|__switch_to(prev = 0xFFFFFFC032A60AC0, next = 0xFFFFFFC0014E2360)
-001|context_switch(inline)
-001|__schedule()
-002|schedule()
-003|schedule_preempt_disabled()
-004|__mutex_lock_common(inline)
-004|__mutex_lock_slowpath(lock_count = 0xFFFFFFC0014E2358)
-005|current_thread_info(inline)
-005|mutex_set_owner(inline)
-005|mutex_lock(lock = 0xFFFFFFC0014E2358)
-006|kmemleak_do_cleanup(?)
-007|static_key_false(inline)
-007|trace_workqueue_execute_end(inline)
-007|process_one_work(worker = 0xFFFFFFC032A0D000, work =
0xFFFFFFC0014E23B0)
-008|worker_thread(__worker = 0xFFFFFFC032A0D000)
-009|kthread(_create = 0xFFFFFFC032E03C80)
-010|ret_from_fork(asm)
-->|exception
-011|create_kthread(inline)
-011|kthreadd(?)
---|end of frame


We can see that it went into sleep trying to acquire the mutex lock and it
never came back and last arrival backs this claim.

sched_info = (
pcount = 1179,
run_delay = 209410806,
last_arrival = 70739794902, ----------------> Last it got a chance
to execute at 70.73 seconds when we called disable and its stuck
there.


Now when we check who is the owner of this mutex lock, we see that
kmemleak scan thread holds it

|__mutex_lock_slowpath(
| lock_count = 0xFFFFFFC0014E2358)
| lock = 0xFFFFFFC0014E2358 -> (
| count = (counter = -1),
| wait_lock = (rlock = (raw_lock = (owner = 4, next = 4), magic =
3735899
| wait_list = (next = 0xFFFFFFC032A6BCD0, prev =
0xFFFFFFC032A6BCD0),
| owner = 0xFFFFFFC030B24B40, --> Corresponds to task_struct
of kmemleak
| name = 0x0,
| magic = 0xFFFFFFC0014E2358)


We tried the following as well to stop the scan thread first and then
schedule work but it did not help :


diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 0bd522f..6105dfe 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1460,7 +1460,6 @@ static void start_scan_thread(void)
static void stop_scan_thread(void)
{
if (scan_thread) {
- kthread_stop(scan_thread);
scan_thread = NULL;
}
}
@@ -1747,6 +1746,9 @@ static void kmemleak_disable(void)
/* stop any memory operation tracing */
atomic_set(&kmemleak_enabled, 0);

+ if(scan_thread)
+ kthread_stop(scan_thread);
+
/* check whether it is too early for a kernel thread */



Therefore in summary, the cleanup_work scheduled after kmemleak_disable()
is stuck in acquiring scan_mutex and this mutex lock is held by kmemleak
scan thread.

In the above stated case we have seen the time difference between the
disable call and crash is close to 0.2 seconds, but we have other
instances where it took more time like close to a second or so :

===
81.797655: <6> kmemleak: Cannot allocate a kmemleak_object structure
81.830707: <6> kmemleak: Kernel memory leak detector disabled
<snip> <snip>
83.015406: <6> Unable to handle kernel paging request at virtual
address ffffff80013bd000
83.022284: <6> pgd = ffffffc00007d000
83.025666: <2> [ffffff80013bd000] *pgd=000000008c307003,
*pmd=00000000b24af003, *pte=0000000000000000
83.034610: <6> Internal error: Oops: 96000007 [#1] PREEMPT SMP
83.040163: <2> Modules linked in: wlan(O)
83.043899: <6> CPU: 2 PID: 204 Comm: kmemleak Tainted: G O
3.10.49-g7ed4272-00237-g8f9114d #1
83.053011: <6> task: ffffffc030b54080 ti: ffffffc030214000 task.ti:
ffffffc030214000
83.060481: <2> PC is at crc32_le+0x78/0x140
83.064384: <2> LR is at kmemleak_scan+0x2d0/0x3d4


====


97.092890: <6> kmemleak: Cannot allocate a kmemleak_object structure
97.098695: <6> kmemleak: Kernel memory leak detector disabled
<snip> <snip>
98.149826: <6> Unable to handle kernel paging request at virtual
address ffffff800038a000
98.156704: <6> pgd = ffffffc00007d000
84.347644: <2> msm_pm_qos_update_request: update request -1
98.160088: <4> [ffffff800038a000] *pgd=000000008c307003
98.164862: <2> , *pmd=00000000b2a36003, *pte=0000000000000000
98.164874: <6> Internal error: Oops: 96000007 [#1] PREEMPT SMP
98.170418: <2> Modules linked in:
98.173462: <6> CPU: 2 PID: 203 Comm: kmemleak Not tainted
3.10.49-g18d73a2 #1
98.180317: <6> task: ffffffc0304bd600 ti: ffffffc030204000 task.ti:
ffffffc030204000
98.187784: <2> PC is at crc32_le+0x78/0x140
98.191688: <2> LR is at kmemleak_scan+0x2d0/0x3d4

====

Any debug pointers/suggestions for the same would be really helpful.

Thanks and regards,
Vignesh Radhakrishnan

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/