Re: next-20200515: Xorg killed due to "OOM"

From: Michal Hocko
Date: Mon Jun 01 2020 - 05:31:55 EST


On Sun 31-05-20 14:16:01, Pavel Machek wrote:
> On Thu 2020-05-28 14:07:50, Michal Hocko wrote:
> > On Thu 28-05-20 14:03:54, Pavel Machek wrote:
> > > On Thu 2020-05-28 11:05:17, Michal Hocko wrote:
> > > > On Tue 26-05-20 11:10:54, Pavel Machek wrote:
> > > > [...]
> > > > > [38617.276517] oom_reaper: reaped process 31769 (chromium), now anon-rss:0kB, file-rss:0kB, shmem-rss:7968kB
> > > > > [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
> > > > > [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 5.7.0-rc5-next-20200515+ #117
> > > > > [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 03/31/2011
> > > > > [38617.277266] Call Trace:
> > > > > [38617.277286] dump_stack+0x54/0x6e
> > > > > [38617.277300] dump_header+0x45/0x321
> > > > > [38617.277313] oom_kill_process.cold+0x9/0xe
> > > > > [38617.277324] ? out_of_memory+0x167/0x420
> > > > > [38617.277336] out_of_memory+0x1f2/0x420
> > > > > [38617.277348] pagefault_out_of_memory+0x34/0x56
> > > > > [38617.277361] mm_fault_error+0x4a/0x130
> > > > > [38617.277372] do_page_fault+0x3ce/0x416
> > > >
> > > > The reason the OOM killer has been invoked is that the page fault
> > > > handler has returned VM_FAULT_OOM. So this is not a result of the page
> > > > allocator struggling to allocate a memory. It would be interesting to
> > > > check which code path has returned this.
> > >
> > > Should the core WARN_ON if that happens and there's enough memory, or
> > > something like that?
> >
> > I wish it would simply go away. There shouldn't be really any reason for
> > VM_FAULT_OOM to exist. The real low on memory situation is already
> > handled in the page allocator.
>
> Umm. Maybe the WARN_ON is first step in that direction? So we can see
> what driver actually did that, and complain to its authors?

This is much harder done than it seems. But maybe this doesn't really
need a full coverage. Some of the code paths which return VM_FAULT_OOM
will simply not fail. But checking for vma->vm_ops->fault() failures
might be interesting. Does the following tell you more about the failure
you can see

diff --git a/mm/memory.c b/mm/memory.c
index 9ab00dcb95d4..5ff023ab7b49 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3442,8 +3442,11 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)

ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
- VM_FAULT_DONE_COW)))
+ VM_FAULT_DONE_COW))) {
+ if (unlikely(ret & VM_FAULT_OOM))
+ pr_warn("VM_FAULT_OOM returned from %ps\n", vma->vm_ops->fault);
return ret;
+ }

if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)

--
Michal Hocko
SUSE Labs