Re: [PATCH 6/6] mm: sigbus instead of abusing oom

From: Wu Fengguang
Date: Tue Nov 10 2009 - 23:35:48 EST


On Wed, Nov 11, 2009 at 10:42:04AM +0800, KOSAKI Motohiro wrote:
> > On Tue, 10 Nov 2009 22:06:49 +0000 (GMT)
> > Hugh Dickins <hugh.dickins@xxxxxxxxxxxxx> wrote:
> >
> > > When do_nonlinear_fault() realizes that the page table must have been
> > > corrupted for it to have been called, it does print_bad_pte() and
> > > returns ... VM_FAULT_OOM, which is hard to understand.
> > >
> > > It made some sense when I did it for 2.6.15, when do_page_fault()
> > > just killed the current process; but nowadays it lets the OOM killer
> > > decide who to kill - so page table corruption in one process would
> > > be liable to kill another.
> > >
> > > Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee
> > > that the process will be killed, but is good enough for such a rare
> > > abnormality, accompanied as it is by the "BUG: Bad page map" message.
> > >
> > > And recent HWPOISON work has copied that code into do_swap_page(),
> > > when it finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
> > >
> > > Signed-off-by: Hugh Dickins <hugh.dickins@xxxxxxxxxxxxx>
> >
> > Thank you !
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> Thank you, me too.
>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>

Thank you!

Reviewed-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>


Some unrelated comments:

We observed that copy_to_user() on a hwpoison page would trigger 3
(duplicate) late kills (the last three lines below):

early kill:
[ 56.964041] virtual address 7fffcab7d000 found in vma
[ 56.964390] 7fffcab7d000 phys b4365000
[ 58.089254] Triggering MCE exception on CPU 0
[ 58.089563] Disabling lock debugging due to kernel taint
[ 58.089914] Machine check events logged
[ 58.090187] MCE exception done on CPU 0
[ 58.090462] MCE 0xb4365: page flags 0x100000000100068=uptodate,lru,active,mmap,anonymous,swapbacked count 1 mapcount 1
[ 58.091878] MCE 0xb4365: Killing copy_to_user_te:3768 early due to hardware memory corruption
[ 58.092425] MCE 0xb4365: dirty LRU page recovery: Recovered
late kill on copy_to_user():
[ 59.136331] Copy 4096 bytes to 00007fffcab7d000
[ 59.136641] MCE: Killing copy_to_user_te:3768 due to hardware memory corruption fault at 7fffcab7d000
[ 59.137231] MCE: Killing copy_to_user_te:3768 due to hardware memory corruption fault at 7fffcab7d000
[ 59.137812] MCE: Killing copy_to_user_te:3768 due to hardware memory corruption fault at 7fffcab7d001

And this patch does not affect it (somehow weird but harmless behavior).

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/