Re: [PATCH] locking/rtmutex: Always use trylock in rt_mutex_trylock()

From: Waiman Long
Date: Tue Oct 08 2024 - 09:22:43 EST


On 10/8/24 3:38 AM, Peter Zijlstra wrote:
On Mon, Oct 07, 2024 at 11:54:54AM -0400, Waiman Long wrote:
On 10/7/24 11:33 AM, Peter Zijlstra wrote:
On Mon, Oct 07, 2024 at 11:23:32AM -0400, Waiman Long wrote:

Is the problem that:

sched_tick()
raw_spin_lock(&rq->__lock);
task_tick_mm_cid()
task_work_add()
kasan_save_stack()
idiotic crap while holding rq->__lock ?

Because afaict that is completely insane. And has nothing to do with
rtmutex.

We are not going to change rtmutex because instrumentation shit is shit.
Yes, it is because of KASAN that causes page allocation while holding the
rq->__lock. Maybe we can blame KASAN for this. It is actually not a problem
for non-PREEMPT_RT kernel because only trylock is being used. However, we
don't use trylock all the way when rt_spin_trylock() is being used with
PREEMPT_RT Kernel.
It has nothing to do with trylock, an everything to do with scheduler
locks being special.

But even so, trying to squirrel a spinlock inside a raw_spinlock is
dodgy at the best of times, yes it mostly works, but should be avoided
whenever possible.

And instrumentation just doesn't count.

This is certainly a problem that we need to fix as there
may be other similar case not involving rq->__lock lurking somewhere.
There cannot be, lock order is:

rtmutex->wait_lock
task->pi_lock
rq->__lock

Trying to subvert that order gets you a splat, any other:

raw_spin_lock(&foo);
spin_trylock(&bar);

will 'work', despite probably not being a very good idea.

Any case involving the scheduler locks needs to be eradicated, not
worked around.
OK, I will see what I can do to work around this issue.
Something like the completely untested below might just work.

The real problem is due to the occasional need to allocate new pages to expand the stack buffer in stack depot that will take additional lock. Fortunately, there is a kasan_record_aux_stack_noalloc() variant that will prevent that. Below is my proposed solution that is less restrictive.

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index cf5e7e891a77..2964171856e0 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -14,11 +14,14 @@ init_task_work(struct callback_head *twork, task_work_func_t func)
 }

 enum task_work_notify_mode {
-    TWA_NONE,
+    TWA_NONE = 0,
     TWA_RESUME,
     TWA_SIGNAL,
     TWA_SIGNAL_NO_IPI,
     TWA_NMI_CURRENT,
+
+    TWA_FLAGS = 0xff00,
+    TWAF_NO_ALLOC = 0x0100,
 };

 static inline bool task_work_pending(struct task_struct *task)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 43e453ab7e20..0259301e572e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10458,7 +10458,9 @@ void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
         return;
     if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
         return;
-    task_work_add(curr, work, TWA_RESUME);
+
+    /* No page allocation under rq lock */
+    task_work_add(curr, work, TWA_RESUME | TWAF_NO_ALLOC);
 }

 void sched_mm_cid_exit_signals(struct task_struct *t)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 5d14d639ac71..c969f1f26be5 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -55,15 +55,26 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
           enum task_work_notify_mode notify)
 {
     struct callback_head *head;
+    int flags = notify & TWA_FLAGS;

+    notify &= ~TWA_FLAGS;
     if (notify == TWA_NMI_CURRENT) {
         if (WARN_ON_ONCE(task != current))
             return -EINVAL;
         if (!IS_ENABLED(CONFIG_IRQ_WORK))
             return -EINVAL;
     } else {
-        /* record the work call stack in order to print it in KASAN reports */
-        kasan_record_aux_stack(work);
+        /*
+         * Record the work call stack in order to print it in KASAN
+         * reports.
+         *
+         * Note that stack allocation can fail if TWAF_NO_ALLOC flag
+         * is set and new page is needed to expand the stack buffer.
+         */
+        if (flags & TWAF_NO_ALLOC)
+            kasan_record_aux_stack_noalloc(work);
+        else
+            kasan_record_aux_stack(work);
     }

     head = READ_ONCE(task->task_works);