Re: frequent lockups in 3.18rc4

From: Linus Torvalds
Date: Mon Dec 15 2014 - 13:22:02 EST


On Sun, Dec 14, 2014 at 9:57 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
>
> We had a flashback to that old bug last month too.
> See this mail & your followup. : https://lkml.org/lkml/2014/11/25/1171
> That was during a bisect though, so may have been something
> entirely different, but it is a spooky coincidence.

Yeah, there's something funny going on there.

Anyway, I've looked at the page fault patch, and I mentioned this last
time it came up: there's a nasty possible kernel loop in the "retry"
case if there's also a fatal signal pending, and we're returning to
kernel mode rather than returning to user mode.

If we return to user mode, the return will handle signals, and we'll
kill the process due to the fatal pending signal and everything is
fine.

But if we're returning to kernel mode, we'll just take the page fault
again. And again. And again. Until the condition that caused the retry
is finally cleared.

Now, normally finishing IO on the page or whatever should get things
done, but whatever. Us busy-looping on it in kernel space might end up
delaying that too forever.

So let's just fix it. Here's a completely untested patch. It looks
bigger than it really is: it moves the "up_read()" up a bit in
__do_page_fault(), so that all the logic is saner. This is "tested" in
the sense that I am running a kernel with this patch, but I could
easily have screwed up some fault handling case.

Anyway, at least CPU1 in your traces was actually going through that
__lock_page_or_retry() code that could trigger this, so...

Linus
arch/x86/mm/fault.c | 65 +++++++++++++++++++++++++----------------------------
1 file changed, 30 insertions(+), 35 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d973e61e450d..b38adc1cd39f 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -844,11 +844,8 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
unsigned int fault)
{
struct task_struct *tsk = current;
- struct mm_struct *mm = tsk->mm;
int code = BUS_ADRERR;

- up_read(&mm->mmap_sem);
-
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
@@ -879,7 +876,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
- up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address, 0, 0);
return;
}
@@ -887,14 +883,11 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
- up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address,
SIGSEGV, SEGV_MAPERR);
return;
}

- up_read(&current->mm->mmap_sem);
-
/*
* We ran out of memory, call the OOM killer, and return the
* userspace (which will retry the fault, or kill us if we got
@@ -1062,7 +1055,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
struct vm_area_struct *vma;
struct task_struct *tsk;
struct mm_struct *mm;
- int fault;
+ int fault, major = 0;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
@@ -1237,14 +1230,31 @@ good_area:
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*/
fault = handle_mm_fault(mm, vma, address, flags);
+ major |= fault & VM_FAULT_MAJOR;

/*
- * If we need to retry but a fatal signal is pending, handle the
- * signal first. We do not need to release the mmap_sem because it
- * would already be released in __lock_page_or_retry in mm/filemap.c.
+ * If we need to retry the mmap_sem has already been released,
+ * and if there is a fatal signal pending there is no guarantee
+ * that we made any progress. Handle this case first.
*/
- if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+ if (unlikely(fault & VM_FAULT_RETRY)) {
+ if ((flags & FAULT_FLAG_ALLOW_RETRY) && !fatal_signal_pending(current)) {
+ /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
+ * of starvation. */
+ flags &= ~FAULT_FLAG_ALLOW_RETRY;
+ flags |= FAULT_FLAG_TRIED;
+ goto retry;
+ }
+
+ /* Not returning to user mode? Handle exceptions or die: */
+ if (!(fault & FAULT_FLAG_USER))
+ no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+
+ /* User mode? Just return to handle the fatal exception */
return;
+ }
+
+ up_read(&mm->mmap_sem);

if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, fault);
@@ -1252,32 +1262,17 @@ good_area:
}

/*
- * Major/minor page fault accounting is only done on the
- * initial attempt. If we go through a retry, it is extremely
- * likely that the page will be found in page cache at that point.
+ * Major/minor page fault accounting. If any of the events
+ * returned VM_FAULT_MAJOR, we account it as a major fault.
*/
- if (flags & FAULT_FLAG_ALLOW_RETRY) {
- if (fault & VM_FAULT_MAJOR) {
- tsk->maj_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
- regs, address);
- } else {
- tsk->min_flt++;
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
- regs, address);
- }
- if (fault & VM_FAULT_RETRY) {
- /* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
- * of starvation. */
- flags &= ~FAULT_FLAG_ALLOW_RETRY;
- flags |= FAULT_FLAG_TRIED;
- goto retry;
- }
+ if (major) {
+ tsk->maj_flt++;
+ perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+ } else {
+ tsk->min_flt++;
+ perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
}
-
check_v8086_mode(regs, address, tsk);
-
- up_read(&mm->mmap_sem);
}
NOKPROBE_SYMBOL(__do_page_fault);