Good luck when RedHat 7.0 comes out (was RE: test5 oops after kswapd)...

From: jeff@ntcor.com
Date: Mon Jul 31 2000 - 14:54:07 EST


A week ago I installed RedHat Rawhide on my machine to check it out.
Everything seems to work fine. Except... Kernel compilations.

Since RedHat 7.0beta just hit the mirrors I'm betting all of my kernel
problems are all of the sudden going to become a wide spread problem.

Here's what I've been struggling with for a week now with very little
progress in terms of successful help...

Keep in mind that Rawhide (and 7.0beta probably) uses GCC 2.96. So if any
of the below problems are bugs in the GCC compiler I would suggest we make
the kernel work around those problems or *Everybody* that goes to 7.0 is
going to freak out like me.

Kernel version 2.2.17pre13: doesn't link completly. aic7xxx complains about
   missing memcpy reference. I've heard this is an issue with GCC 2.96 not
   inlining memcpy properly or something like that None of the patches I've
   seen on the mailing list fixed this. One allowed me to complete the
   compile but the running kernel resulted in *tons* of "failed to load module"
   messages. for modules named (get this)... "A", "B", "C", "D", "E", ... all
   the way through the damn ascii list. Which is more than wonderful when you
   get to Ctrl-N and Ctrl-O.

Thus I thought "Hey, if I can't compile 2.2 maybe they've got it worked out in
2.4..."

Kernel version 2.4-test5: yippie! it compiles fine!. Too bad the kernel
   fails to boot past "Starting kswapd v1.7" which after printing promptly
   produces an oops. No matter how I compile it this happens. Currently I
   compiled a fully non-modular kernel with only the basics to get my system
   up and on the net and this is what I get...

Unable to handle NULL pointer dereference at virtual address 00000000
 printing eip:
c0114b44
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0114b44>]
EFLAGS: 00010092
eax: c14a8000 ebx: 00000023 ecx: 00000000 edx: c0223538
esi: 00000000 edi: c14a8222 ebp: c14a9fcc esp: c14a9fa4
ds: 0018 es: 0018 ss:0018
Process kflushd (pid: 3, stackpage=c14a9000)
Stack: 00000000 c0118fad 00000246 00000000 c14bff48 c14a8000 00000023 c14a8000
       c14a8000 c14a8222 0008e000 c0109300 c14bff3c ffffffff c14a8000 c01bcc27
       00000f00 c14bff70 c012cf70 c0108c2b c14bff3c 00000078 c01f9fd8
Call Trace: [<c0118fad>] [<c0109300>] [<c01bcc27>] [<c012cf70>] [<c0108c2b>]
Code: 8b 01 89 45 dc 21 d8 a9 df ff ff ff 0f 84 f8 00 00 00 8b 55

Which by itself would normally be of great interest and help in finding and
fixing the problem except that ksymoops fails on this. (Yes even ksymoops
2.3.4)...

[root@stingray linux]# ksymoops -v /usr/src/linux/vmlinux -m System.map -O -L -K < /home/jeff/test5plain
ksymoops 2.3.4 on i686 2.2.16-8. Options used
     -v /usr/src/linux/vmlinux (specified)
     -K (specified)
     -L (specified)
     -O (specified)
     -m System.map (specified)

Fatal Error (re_compile): on '^((Stack: )|(Stack from )|([0-9a-fA-F]{4,})|(Call Trace: )|(\[*<([0-9a-fA-F]{4,})>\]* *)|(Version_[0-9]+)|(Trace: )|(Call backtrace:)|(bh:)|<\[([0-9a-fA-F]{4,})\]> *|(\([^ ]+\) *\[*<([0-9a-fA-F]{4,})>\]* *)|([0-9]+ +base=)|(Kernel BackChain)|EBP *EIP|0x([0-9a-fA-F]{4,})
*0x([0-9a-fA-F]{4,}) *|Process .*stackpage=|Code *: |Kernel panic|In swapper task|kmem_free|swapper|Corrupted stack page|invalid operand: |Oops: |Cpu: +[0-9]|current->tss|\*pde +=|EIP: |EFLAGS: |eax: |esi: |ds: |CR0: |wait_on_|irq: |pc[:=]|68060 access|Exception at |d[04]: |Frame format=|wb [0-9]
stat|push data: |baddr=|Arithmetic fault|Instruction fault|Bad unaligned kernel|Forwarding unaligned exception|: unhandled unaligned exception|pc *=|[rvtsa][0-9]+ *=|gp *=|spinlock stuck|tsk->|PSR: |[goli]0: |Instruction DUMP: |spin_lock|TSTATE: |[goli]4: |Caller\[|CPU\[|MSR: |TASK = |last math
|GPR[0-9]+: |\$[0-9 ]+:|epc |Status:|Cause :|Backtrace:|Function entered at|\*pgd =|Internal error|pc :|sp :|r[0-9][0-9 ]:|Flags:|Control:|WARNING:|this_stack:|i:|PSW|cr[0-9]+:|r[0-9]+:|machine check|Exception in |Program Check |System restart |IUCV |unexpected external |Kernel stack |User PSW|User
GPRS|User ACRS|illegal operation|task:)' - Invalid range end

Now granted... I'm quite a kernel internal idiot. But I've been following
everybody's advice on how to trace oops problems and obviously these procedures
aren't working on my system. Which is fine (well for everybody else and the
needs of the many and that sort of stuff.) but I suspect that this will quickly
change as people "Upgrade" to RedHat 7.0 when *nobody's* system is going to
be able to trace oops output properly.

Luckily, I *can* get a decent ksymoops trace if I transfer vmlinux, System.map
and the oops output to a RedHat 6.2 machines (does this help anybody find the
problem? It at least makes me feel like part of the competent crowd again)...

[jeff@www jeff]$ ksymoops -K -L -O -v /home/jeff/vmlinux -m /home/jeff/System.map < oops
ksymoops 0.7c on i686 2.2.17pre14. Options used
     -v /home/jeff/vmlinux (specified)
     -K (specified)
     -L (specified)
     -O (specified)
     -m /home/jeff/System.map (specified)

c0114b44
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0114b44>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010092
eax: c14a8000 ebx: 00000023 ecx: 00000000 edx: c0223538
esi: 00000000 edi: c14a8222 ebp: c14a9fcc esp: c14a9fa4
ds: 0018 es: 0018 ss:0018
Process kflushd (pid: 3, stackpage=c14a9000)
Stack: 00000000 c0118fad 00000246 00000000 c14bff48 c14a8000 00000023 c14a8000
       c14a8000 c14a8222 0008e000 c0109300 c14bff3c ffffffff c14a8000 c01bcc27
       00000f00 c14bff70 c012cf70 c0108c2b c14bff3c 00000078 c01f9fd8
Call Trace: [<c0118fad>] [<c0109300>] [<c01bcc27>] [<c012cf70>] [<c0108c2b>]
Code: 8b 01 89 45 dc 21 d8 a9 df ff ff ff 0f 84 f8 00 00 00 8b 55

>>EIP; c0114b44 <__wake_up+74/280> <=====
Trace; c0118fad <do_softirq+5d/80>
Trace; c0109300 <__up_wakeup+8/c>
Trace; c01bcc27 <stext_lock+3bb/3374>
Trace; c012cf70 <bdflush+0/c0>
Trace; c0108c2b <kernel_thread+2b/40>
Code; c0114b44 <__wake_up+74/280>
00000000 <_EIP>:
Code; c0114b44 <__wake_up+74/280> <=====
   0: 8b 01 mov (%ecx),%eax <=====
Code; c0114b46 <__wake_up+76/280>
   2: 89 45 dc mov %eax,0xffffffdc(%ebp)
Code; c0114b49 <__wake_up+79/280>
   5: 21 d8 and %ebx,%eax
Code; c0114b4b <__wake_up+7b/280>
   7: a9 df ff ff ff test $0xffffffdf,%eax
Code; c0114b50 <__wake_up+80/280>
   c: 0f 84 f8 00 00 00 je 10a <_EIP+0x10a> c0114c4e <__wake_up+17e/280>
Code; c0114b56 <__wake_up+86/280>
  12: 8b 55 00 mov 0x0(%ebp),%edx

So it looks as though the code for wake_up (for interrupt??) is wrong.

Is it a coding error or is GCC producing wrong code?

Could somebody please elaborate on what the above messages mean? I'ld be
willing to track the problem down myself and provide a patch but I don't
really know how to find the offending source code lines. I've tried print
statements (yea I know... lame) but it appears that by this point the
kernel is pretty well multi-threaded and then print statements just get
confusing. How do I find the definition/implementation of the __wake_up
function?

So in summary I perceive that these issues need to be addressed or there
is going to be a lot of problems with the new RedHat series...

a) Is there a source code problem with the 2.4-test5 kernel which is causing
   this oops for me?

   many people seem to be running 2.4-test5 without significant problems, so

b) Is this a problem introduced by GCC 2.96?

c) Can we find out why and produce source code work arounds?

d) Why doesn't ksymoops work on RedHat RawHide installations?
   Keith Owens kindly points out that ksymoops shouldn't break line like it
   did for me and that it did so while compiling a regular expression that
   seems to work for everybody else. This happened both for ksymoops 0.7c
   and 2.3.4, so

e) Is there a problem with some sort of regular expression library that
   ships with RedHat rawhide? ksymoops 0.7c works fine on 6.2 so my guess
   is that yes, there is a library problem on 7.0.

Can anybody help with the above issues? I'ld love to get them cleared up.
Keep in mind that I got Rawhide before it was released as 7.0beta. I haven't
tried the above on 7.0beta. I've got to go download that and install it to
see but that will take me a couple of days just for that and I've fallen
behind in my work schedule just dealing with these kernel problems.

Thanks for your input and help,

- Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Mon Jul 31 2000 - 21:00:34 EST