Re: [PATCH v2 1/1] rcu/tree: Drop the lock before entering to page allocator

From: Paul E. McKenney
Date: Wed Oct 28 2020 - 17:47:36 EST


On Wed, Oct 28, 2020 at 12:10:06PM +0000, Chris Wilson wrote:
> Quoting Uladzislau Rezki (Sony) (2020-07-27 22:10:12)
> > If the kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING
> > option, the lockedp will complain about violation of the
> > nesting rules:
> >
> > <snip>
> > [ 28.060389] =============================
> > [ 28.060389] [ BUG: Invalid wait context ]
> > [ 28.060389] 5.8.0-rc3-rcu #211 Tainted: G E
> > [ 28.060389] -----------------------------
> > [ 28.060390] vmalloc_test/0/523 is trying to lock:
> > [ 28.060390] ffff96df7ffe0228 (&zone->lock){-.-.}-{3:3}, at: get_page_from_freelist+0xcf0/0x16d0
> > [ 28.060391] other info that might help us debug this:
> > [ 28.060391] context-{5:5}
> > [ 28.060392] 2 locks held by vmalloc_test/0/523:
> > [ 28.060392] #0: ffffffffc06750d0 (prepare_for_test_rwsem){++++}-{4:4}, at: test_func+0x76/0x240 [test_vmalloc]
> > [ 28.060393] #1: ffff96df5fa1d390 (krc.lock){..-.}-{2:2}, at: kvfree_call_rcu+0x5c/0x230
> > [ 28.060395] stack backtrace:
> > [ 28.060395] CPU: 0 PID: 523 Comm: vmalloc_test/0 Tainted: G E 5.8.0-rc3-rcu #211
> > [ 28.060395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > [ 28.060396] Call Trace:
> > [ 28.060397] dump_stack+0x96/0xd0
> > [ 28.060397] __lock_acquire.cold.65+0x166/0x2d7
> > [ 28.060398] ? find_held_lock+0x2d/0x90
> > [ 28.060399] lock_acquire+0xad/0x370
> > [ 28.060400] ? get_page_from_freelist+0xcf0/0x16d0
> > [ 28.060401] ? mark_held_locks+0x48/0x70
> > [ 28.060402] _raw_spin_lock+0x25/0x30
> > [ 28.060403] ? get_page_from_freelist+0xcf0/0x16d0
> > [ 28.060404] get_page_from_freelist+0xcf0/0x16d0
> > [ 28.060405] ? __lock_acquire+0x3ee/0x1b90
> > [ 28.060407] __alloc_pages_nodemask+0x16a/0x3a0
> > [ 28.060408] __get_free_pages+0xd/0x30
> > [ 28.060409] kvfree_call_rcu+0x18a/0x230
> > <snip>
>
> We've encountered the same warning and applying this patches resolves
> the issue.
>
> > Internally the kfree_rcu() uses raw_spinlock_t whereas the
> > page allocator internally deals with spinlock_t to access
> > to its zones.
> >
> > In order to prevent such vialation that is in question we
> > can drop the internal raw_spinlock_t before entering to
> > the page allocaor.
> >
> > Short changelog (v1 -> v2):
> > - rework the commit message;
> > - rework the patch making it smaller;
> > - add more comments.
> >
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> > ---
> > kernel/rcu/tree.c | 27 +++++++++++++++++++++++++++
> > 1 file changed, 27 insertions(+)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 21c2fa5bd8c3..2de112404121 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3287,6 +3287,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > return false;
> >
> > lockdep_assert_held(&krcp->lock);
> > + lockdep_assert_irqs_disabled();
> > +
> > idx = !!is_vmalloc_addr(ptr);
> >
> > /* Check if a new block is required. */
> > @@ -3306,6 +3308,29 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > return false;
> >
> > + /*
> > + * If built with CONFIG_PROVE_RAW_LOCK_NESTING option,
> > + * the lockedp will complain about violation of the
> > + * nesting rules. It does the raw_spinlock vs. spinlock
> > + * nesting checks.
> > + *
> > + * That is why we drop the raw lock. Please note IRQs are
> > + * still disabled it guarantees that the "current" stays
> > + * on the same CPU later on when the raw lock is taken
> > + * back.
> > + *
> > + * It is important because if the page allocator is invoked
> > + * in fully preemptible context, it can be that we get a page
> > + * but end up on another CPU. That another CPU might not need
> > + * a page because of having some extra spots in its internal
> > + * array for pointer collecting. Staying on same CPU eliminates
> > + * described issue.
> > + *
> > + * Dropping the lock also reduces the critical section by
> > + * the time taken by the page allocator to obtain a page.
> > + */
> > + raw_spin_unlock(&krcp->lock);
>
> Aiui, this makes the CONFIG_PREEMPT_RT check and subsequent comment redundant.
>
> > /*
> > * NOTE: For one argument of kvfree_rcu() we can
> > * drop the lock and get the page in sleepable
> > @@ -3315,6 +3340,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > */
> > bnode = (struct kvfree_rcu_bulk_data *)
> > __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > +
> > + raw_spin_lock(&krcp->lock);
> > }
>
> The consequence of dropping the lock here is that we may create two
> bnodes and add them both to krcp->bkvhead[idx]. The list remains intact,
> but the second entry may be underutilised. That should be transient.
>
> It doesn't seem like the patch will break things, and removes the
> lockdep splat,
> Tested-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

Thank you for testing this, Chris! We had objections to all patches
thus far, so we are going back to the drawing board, hopefully having
something for the v5.11 merge window.

Thanx, Paul