Re: [PATCH RESEND v2 1/1] fix a dead loop when in heavy low memory

From: David Rientjes
Date: Sun Dec 27 2015 - 20:27:12 EST


On Sun, 27 Dec 2015, Figo Zhang wrote:

> Android System UI hang when run heavy monkey stress test.
>
> V2: add more detail about how to re-produce this issue, the
> important is install more than 100 apps/games.
>
> Re-produce step:
> Run this monkey stress test script with more than 100
> apps/games installed:
>
> adb shell "monkey --ignore-crashes --ignore-timeouts
> --kill-process-after-error --ignore-security-exceptions
> --throttle 200 -v 20000000"
>
> kernel log:
> [ 1526.272125] lowmem_scan start: 128, 213da, ofree -9849 34419, ma 529
> [ 1526.272260] lowmemorykiller: select 'dTi-lm' (27289), adj 647, size 10630, to kill
> [ 1526.272299] lowmem_d_timeout=4296194081
> [ 1526.272303] Killing 'dTi-lm' (27289), adj 647,
> [ 1526.272303] to free 42520kB on behalf of 'servicemanager' (2365) because
> [ 1526.272303] cache 137676kB is below limit 221184kB for oom_score_adj 529
> [ 1526.272303] Free memory is -39396kB above reserved
> [ 1526.272304] lowmem_scan end: 128, 213da, return 10630
> [ 1526.272710] lowmem_scan start: 128, 213da, ofree -9849 34373, ma 529
> [ 1526.272832] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296193081, 4296194081
> [ 1526.274450] lowmem_scan start: 128, 280da, ofree -9601 34327, ma 529
> [ 1526.274695] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296193083, 4296194081
> [ 1526.282292] lowmem_scan start: 128, 213da, ofree -9703 34327, ma 529
> [ 1526.282727] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296193090, 4296194081
> [ 1526.316888] lowmem_scan start: 128, 213da, ofree -9766 34465, ma 529
> [ 1526.317019] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296193125, 4296194081
> [ 1526.319311] lowmem_scan start: 128, 213da, ofree -9856 34419, ma 529
> [ 1526.319442] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296193125, 4296194081
> [ 1526.322026] lowmem_scan start: 128, 280da, ofree -9841 34327, ma 529
> [ 1526.360831] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296193166, 4296194081
> [ 1526.532233] lowmem_scan start: 128, 213da, ofree -9846 34511, ma 529
> [ 1526.644046] lowmem_scan start: 128, 213da, ofree -9785 34235, ma 529
> [ 1527.437578] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296194246, 4296195109
> [ 1527.442559] lowmem_scan start: 128, 213da, ofree -9850 41884, ma 529
> [ 1527.459540] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296194268, 4296195109
> [ 1527.500352] lowmem: TIF_MEMDIE, adj=647, dTi-lm, jiffies=4296194309, 4296195109
>
> when this happened, the android system UI will hang, no process can be
> select to kill.
>
> i found the the value of "lowmem_deathpending_timeout" will be modified
> strangely, like in last killing, the value is 4296194081, but why not it
> had changed to 4296195109? so it will cause the deadloop in low memory
> state which will cause the android system UI hang, because no process will
> be kill.
>

I'm assuming that you are loading the lowmem killer as a module since
that's how you would modify lowmem_debug_level. It appears that
lowmem_debug_level is 2 from your kernel log, otherwise part of the log is
missing.

I can tell this since you have a

[ 1526.272260] lowmemorykiller: select 'dTi-lm' (27289), adj 647, size 10630, to kill

line but not a line matching "send sigkill to %d (%s), adj %hd, size %d\n"
with loglevel 1.

I think changing lowmem_debug_level to 1 would help to understand this
issue better.

I think lowmem_deathpending_timeout is getting changed to 4296195109 at

[ 1526.532233] lowmem_scan start: 128, 213da, ofree -9846 34511, ma 529
> HERE <
[ 1526.644046] lowmem_scan start: 128, 213da, ofree -9785 34235, ma 529

However, it appears that the same process, dTi-lm, is still chosen for oom
kill because lowmem_deathpending_timeout has expired.

So this looks like a problem if the constantly chosen process cannot exit.
It would have been helpful to have the stack of pid 27289 in the log to
see where it was stuck. But I think it may be unrelated to
lowmem_deathpending_timeout itself. We'd be better off selecting a
different process to kill with something like this:

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -128,11 +128,15 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
if (!p)
continue;

- if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
- time_before_eq(jiffies, lowmem_deathpending_timeout)) {
- task_unlock(p);
- rcu_read_unlock();
- return 0;
+ if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+ if (time_before_eq(jiffies,
+ lowmem_deathpending_timeout)) {
+ task_unlock(p);
+ rcu_read_unlock();
+ return 0;
+ }
+ /* Need to select a different process to kill */
+ continue;
}
oom_score_adj = p->signal->oom_score_adj;
if (oom_score_adj < min_score_adj) {

But we need more information. Please make sure that lowmem_debug_level is
1, try to get a complete kernel log, and if possible please try to capture
the stack of the process that can't exit (use /proc/<pid>/stack) before
trying the above patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/