Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback
From: Michal Hocko
Date: Tue Dec 11 2018 - 11:21:57 EST
On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
[...]
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> > struct vm_area_struct *vma = vmf->vma;
> > vm_fault_t ret;
> >
> > + /*
> > + * Preallocate pte before we take page_lock because this might lead to
> > + * deadlocks for memcg reclaim which waits for pages under writeback.
> > + */
> > + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> > + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
> > + if (!vmf->prealloc_pte)
> > + return VM_FAULT_OOM;
> > + smp_wmb(); /* See comment in __pte_alloc() */
> > + }
> > +
> > ret = vma->vm_ops->fault(vmf);
> > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> > VM_FAULT_DONE_COW)))
>
> Sorry, but I don't think it fixes anything. Just hides it a level deeper.
>
> The trick with ->prealloc_pte works for faultaround because we can rely on
> ->map_pages() to not sleep and we know how it will setup page table entry.
> Basically, core controls most of the path.
>
> It's not the case with ->fault(). It is free to sleep and allocate
> whatever it wants.
Yeah, but if the fault callback wants to allocate then it has to
consider the usual allocation restrictions. e.g. NOFS if the allocation
itself can trip over fs locks.
> For instance, DAX page fault will setup page table entry on its own and
> return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
> and ignores your pre-allocated page table.
Does this happen with a page locked and with __GFP_ACCOUNT allocation. I
am not familiar with that code but I do not see it from a quick look.
> But it's just an example. The problem is that ->fault() is not bounded on
> what it can do, unlike ->map_pages().
That is a fair point but the primary issue here is that the generic #PF
code breaks the underlying assumption and performs
__GFP_ACCOUNT|GFP_KERNEL allocation from within a fs owned locked page.
--
Michal Hocko
SUSE Labs