Re: [PATCH 1/2] mm, oom: introduce oom reaper

From: David Rientjes
Date: Mon Feb 01 2016 - 22:02:15 EST


On Thu, 28 Jan 2016, Michal Hocko wrote:

> [...]
> > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > + struct mmu_gather tlb;
> > > + struct vm_area_struct *vma;
> > > + struct zap_details details = {.check_swap_entries = true,
> > > + .ignore_dirty = true};
> > > + bool ret = true;
> > > +
> > > + /* We might have raced with exit path */
> > > + if (!atomic_inc_not_zero(&mm->mm_users))
> > > + return true;
> > > +
> > > + if (!down_read_trylock(&mm->mmap_sem)) {
> > > + ret = false;
> > > + goto out;
> > > + }
> > > +
> > > + tlb_gather_mmu(&tlb, mm, 0, -1);
> > > + for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > + if (is_vm_hugetlb_page(vma))
> > > + continue;
> > > +
> > > + /*
> > > + * mlocked VMAs require explicit munlocking before unmap.
> > > + * Let's keep it simple here and skip such VMAs.
> > > + */
> > > + if (vma->vm_flags & VM_LOCKED)
> > > + continue;
> >
> > Shouldn't there be VM_PFNMAP handling here?
>
> What would be the reason to exclude them?
>

Not exclude them, but I would have expected untrack_pfn().

> > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't
> > used here for simplicity?
>
> I didn't use zap_page_range because I wanted to have a full control over
> what and how gets torn down. E.g. it is much more easier to skip over
> hugetlb pages than relying on i_mmap_lock_write which might be blocked
> and the whole oom_reaper will get stuck.
>

Let me be clear that I think the implementation is fine, minus the missing
handling for VM_PFNMAP. However, I think this implementation is better
placed into mm/memory.c to do the iteration, selection criteria, and then
unmap_page_range(). I don't think we should be exposing
unmap_page_range() globally, but rather add a new function to do the
iteration in mm/memory.c with the others.

> [...]
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > + struct mm_struct *old_mm;
> > > +
> > > + if (!oom_reaper_th)
> > > + return;
> > > +
> > > + /*
> > > + * Pin the given mm. Use mm_count instead of mm_users because
> > > + * we do not want to delay the address space tear down.
> > > + */
> > > + atomic_inc(&mm->mm_count);
> > > +
> > > + /*
> > > + * Make sure that only a single mm is ever queued for the reaper
> > > + * because multiple are not necessary and the operation might be
> > > + * disruptive so better reduce it to the bare minimum.
> > > + */
> > > + old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > + if (!old_mm)
> > > + wake_up(&oom_reaper_wait);
> > > + else
> > > + mmdrop(mm);
> >
> > This behavior is probably the only really significant concern I have about
> > the patch: we just drop the mm and don't try any reaping if there is
> > already reaping in progress.
>
> This is based on the assumption that OOM killer will not select another
> task to kill until the previous one drops its TIF_MEMDIE. Should this
> change in the future we will have to come up with a queuing mechanism. I
> didn't want to do it right away to make the change as simple as
> possible.
>

The problem is that this is racy and quite easy to trigger: imagine if
__oom_reap_vmas() finds mm->mm_users == 0, because the memory of the
victim has been freed, and then another system-wide oom condition occurs
before the oom reaper's mm_to_reap has been set to NULL. No
synchronization prevents that from happening (not sure what the reference
to TIF_MEMDIE is about).

In this case, the oom reaper has ignored the next victim and doesn't do
anything; the simple race has prevented it from zapping memory and does
not reduce the livelock probability.

This can be solved either by queueing mm's to reap or involving the oom
reaper into the oom killer synchronization itself.

> > > +static int __init oom_init(void)
> > > +{
> > > + oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > + if (IS_ERR(oom_reaper_th)) {
> > > + pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > + PTR_ERR(oom_reaper_th));
> > > + oom_reaper_th = NULL;
> > > + } else {
> > > + struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > +
> > > + /*
> > > + * Make sure our oom reaper thread will get scheduled when
> > > + * ASAP and that it won't get preempted by malicious userspace.
> > > + */
> > > + sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> >
> > Eeek, do you really show this is necessary? I would imagine that we would
> > want to limit high priority processes system-wide and that we wouldn't
> > want to be interferred with by memcg oom conditions that trigger the oom
> > reaper, for example.
>
> The idea was that we do not want to allow a high priority userspace to
> preempt this important operation. I do understand your concern about the
> memcg oom interference but I find it more important that oom_reaper is
> runnable when needed. I guess that memcg oom heavy loads can change the
> priority from userspace if necessary?
>

I'm baffled by any reference to "memcg oom heavy loads", I don't
understand this paragraph, sorry. If a memcg is oom, we shouldn't be
disrupting the global runqueue by running oom_reaper at a high priority.
The disruption itself is not only in first wakeup but also in how long the
reaper can run and when it is rescheduled: for a lot of memory this is
potentially long. The reaper is best-effort, as the changelog indicates,
and we shouldn't have a reliance on this high priority: oom kill exiting
can't possibly be expected to be immediate. This high priority should be
removed so memcg oom conditions are isolated and don't affect other loads.

"Memcg oom heavy loads" cannot always be determined and the suggested fix
cannot possibly be to adjust the priority of a global resource. ??