Re: [kerneloops] regression in 2.6.27 wrt "lock_page" and the"hwclock" program

From: Linus Torvalds
Date: Sun Oct 12 2008 - 16:16:56 EST




On Sun, 12 Oct 2008, Karel Zak wrote:
>
> Any suggestion how to nicely implement "don't schedule me out"?

There's nothing you can do. If you take a page fault, you're done. Forget
about any "can't schedule" or "don't enable interrupts". The kernel _has_
to handle the page fault, and that may involve IO and thus random pauses.
No ifs, buts or maybe's about it.

This patch may or may not get rid of the warning, at least. It won't fix
hwclock, but that's apparently unfixable from the kernel - the thing is
just plain buggy.

[ Ingo added to Cc just because this is obviously a x86 tree thing, and
tries to unify some trivial parts of the VM paths at the same time. ]

For hwclock, you may try to:

- do

mlockall(MCL_CURRENT)

before you do the critical region

- set yourself to some realtime scheduling thing

struct sched_param param = {
.sched_priority = 50,
};

sched_setscheduler(0, SCHED_FIFO, &param);

or similar.

and that should mean that you stay on your CPU (by virtue of not being
scheduled away because you're more important than others) and don't take
page faults.

But making yourself real-time also means that any bugs can essentially
kill the system (endless loop).

Linus

---
arch/x86/mm/fault.c | 30 +++++++++++-------------------
1 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a742d75..ac2ad78 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -645,24 +645,23 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
}


-#ifdef CONFIG_X86_32
- /* It's safe to allow irq's after cr2 has been saved and the vmalloc
- fault has been handled. */
- if (regs->flags & (X86_EFLAGS_IF | X86_VM_MASK))
- local_irq_enable();
-
/*
- * If we're in an interrupt, have no user context or are running in an
- * atomic region then we must not take the fault.
+ * It's safe to allow irq's after cr2 has been saved and the
+ * vmalloc fault has been handled.
+ *
+ * User-mode registers count as a user access even for any
+ * potential system fault or CPU buglet.
*/
- if (in_atomic() || !mm)
- goto bad_area_nosemaphore;
-#else /* CONFIG_X86_64 */
- if (likely(regs->flags & X86_EFLAGS_IF))
+ if (user_mode_vm(regs)) {
+ local_irq_enable();
+ error_code |= PF_USER;
+ } else if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();

+#ifdef CONFIG_X86_64
if (unlikely(error_code & PF_RSVD))
pgtable_bad(address, regs, error_code);
+#endif

/*
* If we're in an interrupt, have no user context or are running in an
@@ -671,14 +670,7 @@ void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (unlikely(in_atomic() || !mm))
goto bad_area_nosemaphore;

- /*
- * User-mode registers count as a user access even for any
- * potential system fault or CPU buglet.
- */
- if (user_mode_vm(regs))
- error_code |= PF_USER;
again:
-#endif
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/