Re: [lkp-robot] [x86] ed3ce2a917: BUG:unable_to_handle_kernel

From: Fengguang Wu
Date: Wed Mar 08 2017 - 21:34:19 EST

On Thu, Mar 09, 2017 at 10:13:10AM +0800, Ye Xiaolong wrote:
On 03/02, Borislav Petkov wrote:

On Thu, Mar 02, 2017 at 09:09:34AM +0800, kernel test robot wrote:

FYI, we noticed the following commit:

commit: ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f ("x86: Optimize clear_page()")

in testcase: will-it-scale
with following parameters:

test: poll2
cpufreq_governor: performance

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.

thanks for the report, I was able to reproduce.

BUT(!) this report is misleading because it talks about will-it-scale
but your splat happens when you kexec the kernel:

[ 336.340747] LKP: kexec loading...
[ 336.340852]
[ 336.343323] kexec --noefi -l /tmp/cache/pkg/linux/x86_64-rhel-7.2/gcc-6/ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f/vmlinuz-4.9.0-rc6-00134-ged3ce2a --initrd=/tmp/cache/initrd-concatenated
[ 336.343758]
[ 337.893471] --append=ip=::::lkp-ivb-d01::dhcp root=/dev/ram0 user=lkp job=/lkp/scheduled/lkp-ivb-d01/will-it-scale-poll2-performance-debian-x86_64-2016-08-31.cgz-ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f-20170301-28072-1dqjyhl-11.yaml ARCH=x86_64 kconfig=x86_64-rhel-7.2 branch=linux-devel/devel-hourly-2017022612 commit=ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f BOOT_IMAGE=/pkg/linux/x86_64-rhel-7.2/gcc-6/ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f/vmlinuz-4.9.0-rc6-00134-ged3ce2a max_uptime=1500 RESULT_ROOT=/result/will-it-scale/poll2-performance/lkp-ivb-d01/debian-x86_64-2016-08-31.cgz/x86_64-rhel-7.2/gcc-6/ed3ce2a9172457ef7dbaa9f964e63dfde2bdcb5f/11 LKP_SERVER=inn debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_
[ 337.895521]
[ 339.467661] BUG: unable to handle kernel paging request at ffff8803cf2e2008
[ 339.468000] IP: [<ffffffff81061e71>] native_set_pmd+0x1/0x10

Maybe Fengguang has an idea what to do here, maybe something like add
markers to the log to denote where the test environment is prepared and
when the actual test starts. Then grep for those and generate the report
based on that...

Thanks for the suggestions, we'll keep improving the reports to avoid confusion
or misleading.

One possible improvement is to provide "lkp qemu" reproduce steps for
kernel oops -- it would be way more convenient and safe to follow than
"lkp run", since the later risks hang the physical machine.

As for the test description, the dmesg carries markers for the user
space test start/stop points, so the robot can easily tell whether the
oops happen during the test or before/after the test -- the latter may
well (but not always) indicate the oops is not relevant to the testcase,
but to the regular kernel boot/reboot/kexec process.