Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

From: Jason Gunthorpe
Date: Fri Aug 16 2019 - 10:31:50 EST

On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
> On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > > >
> > > > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > > >
> > > > > I hope I will not make this muddy again ;)
> > > > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > > > allocation allowed in the context it is called from.
> > > >
> > > > 'in the context is is called from' is the magic phrase, as
> > > > invalidate_range_start is called while holding several different mm
> > > > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > > > (write?)
> > > >
> > > > Can GFP_KERNEL be called while holding those locks?
> > >
> > > i_mmap_rwsem would be problematic because it is taken during the
> > > reclaim.
> >
> > Okay.. So the fs_reclaim debugging does catch errors.
> I do not think fs_reclaim is the udnerlying mechanism to catch this
> deadlock.

Indeed lockdep would catch it directly as an AA deadlock, but only if
we happen to take the reclaim path under the kmalloc in the notifier
and lucked into it also choosing to lock the same lock we are holding.

The trouble is this is very difficult to hit.

lockdep allows making this less difficult. For instance with a
'might_reclaim()' annotation in the allocator we could check that the
various reclaim related locks are obtainable. Testing doesn't need to
get lucky and go down the very unlikely path.

It turns out this is already happening, it is actually a side effect
of the way fs_reclaim works now.

> > Do you have any
> > reference for what a false positive looks like?
> I believe I have given some examples when introducing __GFP_NOLOCKDEP.

Okay, I think that is 7e7844226f10 ("lockdep: allow to disable reclaim
lockup detection") Hmm, sadly the lkml link in the commit is broken.

Hum. There are no users of __GFP_NOLOCKDEP today?? Could all the false
positives have been fixed??

Keep in mind that this fs_reclaim was reworked away from the hacky
interrupt context flag to a fairly elegant real lock map.

Based on my read of the commit message, my first reaction would be to
suggest lockdep_set_class() to solve the problem described, ie
different classes for 'inside transaction' and 'outside transaction'
on xfs_refcountbt_init_cursor()

I understood that generally that is the way to handle lockdep false

Anyhow, if you are willing to consider that lockdep isn't broken, I
have some ideas on how to make this clearer and increase
coverage. Would you be willing to look at patches on this topic? (not
soon, I have to finish my mmu notifier stuff)

> > I would like to inject it into the notifier path as this is very
> > difficult for driver authors to discover and know about, but I'm
> > worried about your false positive remark.
> >
> > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > we need a strategy to handle OOM to guarentee forward progress.
> Your example is from the notifier registration IIUC.

Yes, that is where this commit hit it.. Triggering this under an
actual notifier to get a lockdep report is hard.

> Can you pre-allocate before taking locks? Could you point me to some
> examples when the allocation is necessary in the range notifier
> callback?

Hmm. I took a careful look, I only found mlx5 as obviously allocating

__get_free_pages(gfp, get_order(size));

However, I think this could be changed to fall back to some small
buffer if allocation fails. The existing scheme looks sketchy

nouveau does:


But I think it reliably uses a stack buffer here

i915 I think Daniel said he audited.

amd_mn.. The actual invalidate_range_start does not allocate memory,
but it is entangled with so many locks it would need careful analysis
to be sure.

The others look generally OK, which is good, better than I hoped :)