PROBLEM: infinite loop do_sparc64_fault with fault_code 2

From: weiqi
Date: Tue Jun 02 2015 - 02:54:41 EST



Hello,
ÂÂ Everyone
ÂÂ Â Â Nearly, I'm working on a sparc64 machine with linux-2.6.32 (32 cores, SMP) ,64bit kernel and userspace is 32bit.
Â
ÂÂ Â when I run LTP test case with command :"./kill10 -c100 -g 1 -n
1", It will trap in an infinite page_fault  loop occasionally. andÂ
one of the kill10 process will use 100% CPU . (easy to repeat, just
run command again and again)

ÂÂ Â Â After some debug, I find :

   1) the fault address is the same, and always at kill10's user-stack, for example "0xffb0b470".

ÂÂ

 2) the fault happend when kill10 handle signal at put_user() ,
code path: arch/sparc/kernel/signal32.c: setup_frame32()Â -->
put_user().

   3) The first fault is handled by do_wp_page()
because of COW, and then do_wp_page() found PageAnon(old_page) then
reuse old_page.

ÂÂ
 4) then go into infinite loop fault with fault_code 2 (D-TLB
miss), and handled by handle_pte_fault() out at flush_tlb_page() which
has a comment :
ÂÂÂÂÂÂÂ Â Â Â Â /*
ÂÂÂÂÂÂÂÂ Â Â Â Â * This is needed only for protection faults but the arch code
ÂÂÂÂÂÂÂÂ Â Â Â Â * is not yet telling us if this is a protection fault or not.
ÂÂÂÂÂÂÂÂ Â Â Â Â * This still avoids useless tlb flushes for .text page faults
ÂÂÂÂÂÂÂÂ Â Â Â Â * with threads.
ÂÂÂÂÂÂÂÂ Â Â Â Â */
ÂÂÂÂÂÂÂÂÂÂ Â Â Â Â if (flags & FAULT_FLAG_WRITE)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Â Â Â Â flush_tlb_page(vma, address);

  I'v also tested with linux-3.10, and almost same result.
ÂÂ
 I know sparc has software tlb process, In the function do_wp_page(),
it will call flush_tlb_page() and update_mmu_cache() , but It seemsÂ
no effect, just  D-TLB miss infinitely at same address

N‹§²æ¸›yú²X¬¶ÇvØ–)Þ{.nlj·¥Š{±‘êX§¶›¡Ü}©ž²ÆzÚj:+v‰¨¾«‘êZ+€Êzf£¢·hšˆ§~†­†Ûÿû®w¥¢¸?™¨è&¢)ßf”ùy§m…á«a¶Úÿ 0¶ìå