Re: [RFC] memory reserve for userspace oom-killer
From: Peter.Enderborg
Date: Thu Apr 22 2021 - 01:39:43 EST
On 4/21/21 9:18 PM, Shakeel Butt wrote:
> On Wed, Apr 21, 2021 at 11:46 AM <Peter.Enderborg@xxxxxxxx> wrote:
>> On 4/21/21 8:28 PM, Shakeel Butt wrote:
>>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
>>> <peter.enderborg@xxxxxxxx> wrote:
>>>> On 4/20/21 3:44 AM, Shakeel Butt wrote:
>>> [...]
>>>> I think this is the wrong way to go.
>>> Which one? Are you talking about the kernel one? We already talked out
>>> of that. To decide to OOM, we need to look at a very diverse set of
>>> metrics and it seems like that would be very hard to do flexibly
>>> inside the kernel.
>> You dont need to decide to oom, but when oom occurs you
>> can take a proper action.
> No, we want the flexibility to decide when to oom-kill. Kernel is very
> conservative in triggering the oom-kill.
It wont do it for you. We use this code to solve that:
/*
* lowmemorykiller_oom
*
* Author: Peter Enderborg <peter.enderborg@xxxxxxxxxxxxxx>
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*/
/* add fake print format with original module name */
#define pr_fmt(fmt) "lowmemorykiller: " fmt
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/oom.h>
#include <trace/events/lmk.h>
#include "lowmemorykiller.h"
#include "lowmemorykiller_tng.h"
#include "lowmemorykiller_stats.h"
#include "lowmemorykiller_tasks.h"
/**
* lowmemorykiller_oom_notify - OOM notifier
* @self: notifier block struct
* @notused: not used
* @parm: returned - number of pages freed
*
* Return value:
* NOTIFY_OK
**/
static int lowmemorykiller_oom_notify(struct notifier_block *self,
unsigned long notused, void *param)
{
struct lmk_rb_watch *lrw;
unsigned long *nfreed = param;
lowmem_print(2, "oom notify event\n");
*nfreed = 0;
lmk_inc_stats(LMK_OOM_COUNT);
spin_lock_bh(&lmk_task_lock);
lrw = __lmk_task_first();
if (lrw) {
struct task_struct *selected = lrw->tsk;
struct lmk_death_pending_entry *ldpt;
if (!task_trylock_lmk(selected)) {
lmk_inc_stats(LMK_ERROR);
lowmem_print(1, "Failed to lock task.\n");
lmk_inc_stats(LMK_BUSY);
goto unlock_out;
}
get_task_struct(selected);
/* move to kill pending set */
ldpt = kmem_cache_alloc(lmk_dp_cache, GFP_ATOMIC);
/* if we fail to alloc we ignore the death pending list */
if (ldpt) {
ldpt->tsk = selected;
__lmk_death_pending_add(ldpt);
} else {
WARN_ON(1);
lmk_inc_stats(LMK_MEM_ERROR);
trace_lmk_sigkill(selected->pid, selected->comm,
LMK_TRACE_MEMERROR, *nfreed, 0);
}
if (!__lmk_task_remove(selected, lrw->key))
WARN_ON(1);
spin_unlock_bh(&lmk_task_lock);
*nfreed = get_task_rss(selected);
send_sig(SIGKILL, selected, 0);
LMK_TAG_TASK_DIE(selected);
trace_lmk_sigkill(selected->pid, selected->comm,
LMK_TRACE_OOMKILL, *nfreed,
0);
task_unlock(selected);
put_task_struct(selected);
lmk_inc_stats(LMK_OOM_KILL_COUNT);
goto out;
}
unlock_out:
spin_unlock_bh(&lmk_task_lock);
out:
return NOTIFY_OK;
}
static struct notifier_block lowmemorykiller_oom_nb = {
.notifier_call = lowmemorykiller_oom_notify
};
int __init lowmemorykiller_register_oom_notifier(void)
{
register_oom_notifier(&lowmemorykiller_oom_nb);
return 0;
}
So what needed is a function that knows the
priority. Here __lmk_task_first() that is from a rb-tree.
You can pick what ever priority you like. In our case it is a
android so it is a strictly oom_adj order in the tree.
I think you can do the same with a old lowmemmorykiller style with
a full task scan too.
> [...]
>>> Actually no. It is missing the flexibility to monitor metrics which a
>>> user care and based on which they decide to trigger oom-kill. Not sure
>>> how will watchdog replace psi/vmpressure? Userspace keeps petting the
>>> watchdog does not mean that system is not suffering.
>> The userspace should very much do what it do. But when it
>> does not do what it should do, including kick the WD. Then
>> the kernel kicks in and kill a pre defined process or as many
>> as needed until the monitoring can start to kick and have the
>> control.
>>
> Roman already suggested something similar (i.e. oom-killer core and
> extended and core watching extended) but completely in userspace. I
> don't see why we would want to do that in the kernel instead.
A watchdog in kernel will work if userspace is completely broken
or staved with low on memory.
>
>>> In addition oom priorities change dynamically and changing it in your
>>> system seems very hard. Cgroup awareness is missing too.
>> Why is that hard? Moving a object in a rb-tree is as good it get.
>>
> It is a group of objects. Anyways that is implementation detail.
>
> The message I got from this exchange is that we can have a watchdog
> (userspace or kernel) to further improve the reliability of userspace
> oom-killers.