Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock

From: Peter Zijlstra
Date: Tue Mar 22 2016 - 09:56:28 EST


On Tue, Mar 22, 2016 at 02:26:00PM +0100, Heiko Carstens wrote:
> > Clearly something magical is going on and its not clear.
>
> The mechanism of our pfault code: if Linux is running as guest, runs a user
> space process and the user space process accesses a page that the host has
> paged out we get a pfault interrupt.
>
> This allows us, within the guest, to schedule a different process. Without
> this mechanism the host would have to suspend the whole virtual CPU until
> the page has been paged in.
>
> So when we get such an interrupt then we set the state of the current task
> to uninterruptible and also set the need_resched flag. Both happens within
> interrupt context(!). If we later on want to return to user space we
> recognize the need_resched flag and then call schedule().
> It's not very obvious how this works...

A few lines like the above near that function would go a long while I
think.

And, ah!, you rely on the return to user resched to not be a
preempt_schedule, how very icky :-)

Now, what happens if that task gets a spurious wakeup? Will it take the
fault again, raise the PF int again etc.. ?

> Of course we have a lot of additional fun with the completion interrupt (->
> host signals that a page of a process has been paged in and the process can
> continue to run). This interrupt can arrive on any cpu and, since we have
> virtual cpus, actually appear before the interrupt that signals that a page
> is missing.

Of course :-)

Something like the below perhaps?

---
arch/s390/mm/fault.c | 44 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 791a4146052c..52cc8c99e62c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -629,6 +629,29 @@ void pfault_fini(void)
static DEFINE_SPINLOCK(pfault_lock);
static LIST_HEAD(pfault_list);

+#define PF_COMPLETE 0x0080
+
+/*
+ * The mechanism of our pfault code: if Linux is running as guest, runs a user
+ * space process and the user space process accesses a page that the host has
+ * paged out we get a pfault interrupt.
+ *
+ * This allows us, within the guest, to schedule a different process. Without
+ * this mechanism the host would have to suspend the whole virtual CPU until
+ * the page has been paged in.
+ *
+ * So when we get such an interrupt then we set the state of the current task
+ * to uninterruptible and also set the need_resched flag. Both happens within
+ * interrupt context(!). If we later on want to return to user space we
+ * recognize the need_resched flag and then call schedule(). It's not very
+ * obvious how this works...
+ *
+ * Of course we have a lot of additional fun with the completion interrupt (->
+ * host signals that a page of a process has been paged in and the process can
+ * continue to run). This interrupt can arrive on any cpu and, since we have
+ * virtual cpus, actually appear before the interrupt that signals that a page
+ * is missing.
+ */
static void pfault_interrupt(struct ext_code ext_code,
unsigned int param32, unsigned long param64)
{
@@ -637,14 +660,14 @@ static void pfault_interrupt(struct ext_code ext_code,
pid_t pid;

/*
- * Get the external interruption subcode & pfault
- * initial/completion signal bit. VM stores this
- * in the 'cpu address' field associated with the
- * external interrupt.
+ * Get the external interruption subcode & pfault initial/completion
+ * signal bit. VM stores this in the 'cpu address' field associated
+ * with the external interrupt.
*/
subcode = ext_code.subcode;
if ((subcode & 0xff00) != __SUBCODE_MASK)
return;
+
inc_irq_stat(IRQEXT_PFL);
/* Get the token (= pid of the affected task). */
pid = param64 & LPP_PFAULT_PID_MASK;
@@ -655,8 +678,9 @@ static void pfault_interrupt(struct ext_code ext_code,
rcu_read_unlock();
if (!tsk)
return;
+
spin_lock(&pfault_lock);
- if (subcode & 0x0080) {
+ if (subcode & PF_COMPLETE) {
/* signal bit is set -> a page has been swapped in by VM */
if (tsk->thread.pfault_wait == 1) {
/* Initial interrupt was faster than the completion
@@ -683,10 +707,10 @@ static void pfault_interrupt(struct ext_code ext_code,
/* signal bit not set -> a real page is missing. */
if (WARN_ON_ONCE(tsk != current))
goto out;
+
if (tsk->thread.pfault_wait == 1) {
/* Already on the list with a reference: put to sleep */
- __set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- set_tsk_need_resched(tsk);
+ goto block;
} else if (tsk->thread.pfault_wait == -1) {
/* Completion interrupt was faster than the initial
* interrupt (pfault_wait == -1). Set pfault_wait
@@ -701,7 +725,11 @@ static void pfault_interrupt(struct ext_code ext_code,
get_task_struct(tsk);
tsk->thread.pfault_wait = 1;
list_add(&tsk->thread.list, &pfault_list);
- __set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+block:
+ /* Since this must be a userspace fault, there
+ * is no kernel task state to trample. Rely on the
+ * return to userspace schedule() to block */
+ __set_current_state(TASK_UNINTERRUPTIBLE);
set_tsk_need_resched(tsk);
}
}