Re: [PATCH] locking/rtmutex: Always use trylock in rt_mutex_trylock()

From: Waiman Long
Date: Mon Oct 07 2024 - 11:23:52 EST



On 10/7/24 10:50 AM, Peter Zijlstra wrote:
On Wed, Oct 02, 2024 at 01:54:16PM -0400, Waiman Long wrote:
On 10/2/24 05:37, Peter Zijlstra wrote:
On Thu, Sep 26, 2024 at 11:13:15AM -0400, Waiman Long wrote:
One reason to use a trylock is to avoid a ABBA deadlock in case we need
to use a locking sequence that is not in the expected locking order. That
requires the use of trylock all the ways in the abnormal locking
sequence. Unfortunately, this is not the case for rt_mutex_trylock() as
it uses a raw_spin_lock_irqsave() to acquire the lock->wait_lock. That
will cause a lockdep splat like the following in a PREEMPT_RT kernel:

[ 63.695668] -> #0 (&lock->wait_lock){-...}-{2:2}:
[ 63.695674] check_prev_add+0x1bd/0x1310
[ 63.695678] validate_chain+0x6cf/0x7c0
[ 63.695682] __lock_acquire+0x879/0xf40
[ 63.695686] lock_acquire.part.0+0xfa/0x2d0
[ 63.695690] _raw_spin_lock_irqsave+0x46/0x90
[ 63.695695] rt_mutex_slowtrylock+0x3f/0xb0
[ 63.695699] rt_spin_trylock+0x13/0xc0
[ 63.695702] rmqueue_pcplist+0x5b/0x180
[ 63.695705] rmqueue+0xb01/0x1040
[ 63.695708] get_page_from_freelist+0x1d0/0xa00
[ 63.695712] __alloc_pages_noprof+0x19a/0x450
[ 63.695716] alloc_pages_mpol_noprof+0xaf/0x1e0
[ 63.695721] stack_depot_save_flags+0x4db/0x520
[ 63.695727] kasan_save_stack+0x3f/0x50
[ 63.695731] __kasan_record_aux_stack+0x8e/0xa0
[ 63.695736] task_work_add+0x1ad/0x240
[ 63.695741] sched_tick+0x1c7/0x500
[ 63.695744] update_process_times+0xf1/0x130
[ 63.695750] tick_nohz_handler+0xf7/0x230
[ 63.695754] __hrtimer_run_queues+0x13b/0x7b0
[ 63.695758] hrtimer_interrupt+0x1c2/0x350
[ 63.695763] __sysvec_apic_timer_interrupt+0xdb/0x340
[ 63.695770] sysvec_apic_timer_interrupt+0x9c/0xd0
[ 63.695774] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 63.695780] __asan_load8+0x8/0xa0
[ 63.695784] mas_wr_end_piv+0x28/0x2c0
[ 63.695789] mas_preallocate+0x3a8/0x680
[ 63.695793] vma_shrink+0x180/0x3f0
[ 63.695799] shift_arg_pages+0x219/0x2c0
[ 63.695804] setup_arg_pages+0x343/0x5e0
[ 63.695807] load_elf_binary+0x5ac/0x15d0
[ 63.695813] search_binary_handler+0x125/0x370
[ 63.695818] exec_binprm+0xc9/0x3d0
[ 63.695821] bprm_execve+0x103/0x230
[ 63.695824] kernel_execve+0x187/0x1c0
[ 63.695828] call_usermodehelper_exec_async+0x145/0x1e0
[ 63.695832] ret_from_fork+0x31/0x60
[ 63.695836] ret_from_fork_asm+0x1a/0x30
[ 63.695840]
[ 63.695840] other info that might help us debug this:
[ 63.695840]
[ 63.695842] Chain exists of:
[ 63.695842] &lock->wait_lock --> &p->pi_lock --> &rq->__lock
[ 63.695842]
[ 63.695850] Possible unsafe locking scenario:
[ 63.695850]
[ 63.695851] CPU0 CPU1
[ 63.695852] ---- ----
[ 63.695854] lock(&rq->__lock);
[ 63.695857] lock(&p->pi_lock);
[ 63.695861] lock(&rq->__lock);
[ 63.695864] lock(&lock->wait_lock);
[ 63.695868]
[ 63.695868] *** DEADLOCK ***

Fix it by using raw_spin_trylock_irqsave() instead.
That truncated lockdep report doesn't really explain anything. Please
just transcribe the full lockdep report into something small and
coherent.
I was trying to show where the offending call is coming from. I will send a
v2 with a condensed lockdep splat.
No no no... explain the actual problem.

Is the problem that:

sched_tick()
task_tick_mm_cid()
task_work_add()
kasan_save_stack()
idiotic crap while holding rq->__lock ?

Because afaict that is completely insane. And has nothing to do with
rtmutex.

We are not going to change rtmutex because instrumentation shit is shit.

Yes, it is because of KASAN that causes page allocation while holding the rq->__lock. Maybe we can blame KASAN for this. It is actually not a problem for non-PREEMPT_RT kernel because only trylock is being used. However, we don't use trylock all the way when rt_spin_trylock() is being used with PREEMPT_RT Kernel. This is certainly a problem that we need to fix as there may be other similar case not involving rq->__lock lurking somewhere.

Cheers,
Longman