Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

From: Jason Gunthorpe
Date: Thu Aug 15 2019 - 16:13:27 EST


On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

> > The last detail is I'm still unclear what a GFP flags a blockable
> > invalidate_range_start() should use. Is GFP_KERNEL OK?
>
> I hope I will not make this muddy again ;)
> invalidate_range_start in the blockable mode can use/depend on any sleepable
> allocation allowed in the context it is called from.

'in the context is is called from' is the magic phrase, as
invalidate_range_start is called while holding several different mm
related locks. I know at least write mmap_sem and i_mmap_rwsem
(write?)

Can GFP_KERNEL be called while holding those locks?

This is the question of indirect dependency on reclaim via locks you
raised earlier.

> So in other words it is no different from any other function in the
> kernel that calls into allocator. As the API is missing gfp context
> then I hope it is not called from any restricted contexts (except
> from the oom which we have !blockable for).

Yes, the callers are exactly my concern.

> > Lockdep has
> > complained on that in past due to fs_reclaim - how do you know if it
> > is a false positive?
>
> I would have to see the specific lockdep splat.

See below. I found it when trying to understand why the registration
of the mmu notififer was so oddly coded.

The situation was:

down_write(&mm->mmap_sem);
mm_take_all_locks(mm);
kmalloc(GFP_KERNEL); <--- lockdep warning

I understood Daniel said he saw this directly on a recent kernel when
working with his lockdep patch?

Checking myself, on todays kernel I see a call chain:

shrink_all_memory
fs_reclaim_acquire(sc.gfp_mask);
[..]
do_try_to_free_pages
shrink_zones
shrink_node
shrink_node_memcg
shrink_list
shrink_active_list
page_referenced
rmap_walk
rmap_walk_file
i_mmap_lock_read
down_read(i_mmap_rwsem)

So it is possible that the down_read() above will block on
i_mmap_rwsem being held in the caller of invalidate_range_start which
is doing kmalloc(GPF_KERNEL).

Is this OK? The lockdep annotation says no..

Jason

commit 35cfa2b0b491c37e23527822bf365610dbb188e5
Author: Gavin Shan <shangw@xxxxxxxxxxxxxxxxxx>
Date: Thu Oct 25 13:38:01 2012 -0700

mm/mmu_notifier: allocate mmu_notifier in advance

While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
to work in case of tight available memory. Eventually, that would lead to
a deadlock while the swap deamon swaps anonymous pages. It was caused by
commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").

=================================
[ INFO: inconsistent lock state ]
3.7.0-rc1+ #518 Not tainted
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
{RECLAIM_FS-ON-W} state was registered at:
mark_held_locks+0x86/0x150
lockdep_trace_alloc+0x67/0xc0
kmem_cache_alloc_trace+0x33/0x230
do_mmu_notifier_register+0x87/0x180
mmu_notifier_register+0x13/0x20
kvm_dev_ioctl+0x428/0x510
do_vfs_ioctl+0x98/0x570
sys_ioctl+0x91/0xb0
system_call_fastpath+0x16/0x1b