Re: kernel CI: printk loglevels in kernel boot logs?
From: Fengguang Wu
Date: Tue Nov 21 2017 - 22:27:10 EST
[CC LKML for possible printk improvements]
On Wed, Nov 22, 2017 at 09:56:10AM +0800, Fengguang Wu wrote:
Hi Kevin,
On Tue, Nov 21, 2017 at 12:27:48PM -0800, Kevin Hilman wrote:
Hi Fenguang,
In automated testing, for ease of parsing kernel boot logs (especially
separating warnings and errors from debug, info etc.)
Right now we can get this info from a "dmesg --raw" after bootup, but
it would be really nice in certain automation frameworks to have a
kernel command-line option to enable printing of loglevels in default
boot log.
Agreed.
This is espeically useful when ingesting kernel logs into advanced
search/analytics frameworks (I'm playing with and ELK stack: Elastic
Search, Logstash, Kibana).
Does that sound like a feature you'd be interested in?
Yes, sure.
In kernelCI, we're considering submitting a patch to add a
"show_loglevel" command-line argument to enable that option on kernel
boot.
Thanks for doing that patch! It'll obviously make it easier to catch
various warnings, which will be useful when used with caution,
especially when false warnings (wrt. real problems that should be
fixed) can be effectively filtered out.
[...]
As you may know I'm currently reporting kernel oops in mainline
kernel, hoping to clear noisy oops there -- they obviously hurt
catching and bisecting new oops.
I have to say the warnings are much more noisy than kernel oops
in 2 ways:
1) It's common for a *normal* kernel boot to have a dozen of old
warnings.
2) Many warnings do not necessarily mean something should or could be
fixed -- they may well be mentioning some HW problem, or an alert
message to the user.
So there is a much bigger and messy problem than catching the warnings:
ways to effectively mark or filter real warnings that automated testing
should catch and report.
For filtering, we currently accumulated the below blacklist:
https://github.com/fengguang/lkp-tests/blob/master/etc/kmsg-blacklist
[...]
For the marking part, I wonder if there can be a clear rule that
allows developers to distinguish 2 kind of information for users and
testers:
- "bug" message: indicating a regression that should be reported and fixed
- "fact" message: well there's a problem, but we kernel developers
probably can do nothing about it. It's just to let the user know
about the fact. The fix might be something like replacing a broken
disk drive.
Those message types are orthogonal to severity of the problem (log
levels), so the current log levels are actually not sufficient for
distinguishing these kind of situations.
Here are the list of warning ids waiting for bisect in 0day test farm.
As you may see, only relatively few ones may worth bisecting and
reporting. We've already built a pretty large blacklist, however it's
proved to be awkward and ineffective -- it obviously suffers from both
coverage and expertise problems.
777 hub#-#:#:config_failed,can't_get_hub_status(err#)
328 Firmware_Bug]:TSC_DEADLINE_disabled_due_to_Errata;please_update_microcode_to_version:#(or_later)
31 ACPI_Error:Field[CPB3]at_bit_offset/length#exceeds_size_of_target_Buffer(#bits)(#/dsopcode-#)
30 ACPI_Error:Method_parse/execution_failed~_SB.PMI0._PMC,AE_NOT_EXIST(#/psparse-#)
29 ACPI_Error:Method_parse/execution_failed~_SB.PMI0._GHL,AE_NOT_EXIST(#/psparse-#)
28 ACPI_Error:Region_IPMI(ID=#)has_no_handler(#/exfldio-#)
28 ACPI_Error:No_handler_for_Region[SYSI](#)[IPMI](#/evregion-#)
27 ACPI_Error:Method_parse/execution_failed~_SB._OSC,AE_AML_BUFFER_LIMIT(#/psparse-#)
25 tpm_tpm#:A_TPM_error(#)occurred_attempting_get_random
24 IP-Config:Reopening_network_devices
19 DHCP/BOOTP:Ignoring_fragmented_reply
8 xhci_hcd#:#:#:init#:#:#fail
8 xhci_hcd#:#:#:can't_setup
6 in_atomic():#,irqs_disabled():#,pid:#,name:swapper
4 megaraid_sas#:#:#:Init_cmd_return_status_SUCCESS_for_SCSI_host
4 in_atomic():#,irqs_disabled():#,pid:#,name:kworker
4 drm:drm_atomic_helper_commit_cleanup_done[drm_kms_helper]]*ERROR*[CRTC:#:pipe_A]flip_done_timed_out
2 print_req_error:I/O_error,dev_loop#,sector
2 IP-Config:Failed_to_open_erspan0
2 in_atomic():#,irqs_disabled():#,pid:#,name:systemd-udevd
2 in_atomic():#,irqs_disabled():#,pid:#,name:perf
2 acerhdf:unknown(unsupported)BIOS_version_QEMU/Standard_PC(i440FX+PIIX,#).#-#,please_report,aborting
1 XFS(sda8):metadata_I/O_error:block#(~xfs_readlink_bmap_ilocked~)error#numblks
1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_symlink_read_verify[xfs],xfs_symlink_block
1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_dir3_data_read_verify[xfs],xfs_dir3_data_block
1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_da3_node_read_verify[xfs],xfs_da3_node_block
1 XFS(sda8):Metadata_CRC_error_detected_at_xfs_attr3_leaf_read_verify[xfs],xfs_attr3_leaf_block
1 Memory_failure:#:recovery_action_for_huge_page:Recovered
1 Memory_failure:#:recovery_action_for_clean_LRU_page:Recovered
1 Memory_failure:#:Killing_tinjpage:#due_to_hardware_memory_corruption
1 mce:[Hardware_Error]:TSC#MISC#df87b000d9eff
1 in_atomic():#,irqs_disabled():#,pid:#,name:xargs
1 in_atomic():#,irqs_disabled():#,pid:#,name:triad_loop
1 in_atomic():#,irqs_disabled():#,pid:#,name:tchain_edit
1 in_atomic():#,irqs_disabled():#,pid:#,name:sh
1 in_atomic():#,irqs_disabled():#,pid:#,name:sed
1 in_atomic():#,irqs_disabled():#,pid:#,name:run.sh
1 in_atomic():#,irqs_disabled():#,pid:#,name:perf_test
1 in_atomic():#,irqs_disabled():#,pid:#,name:noploop.sh
1 in_atomic():#,irqs_disabled():#,pid:#,name:jbd2/sda1
1 in_atomic():#,irqs_disabled():#,pid:#,name:grep
1 in_atomic():#,irqs_disabled():#,pid:#,name:fallocate
1 in_atomic():#,irqs_disabled():#,pid:#,name:dmesg
1 in_atomic():#,irqs_disabled():#,pid:#,name:cc1
1 do_IRQ:#No_irq_handler_for_vector
1 Buffer_I/O_error_on_dev_dm-#,logical_block#,async_page_read
We already have a well known answer to "marking bugs", ie. to dump a
pile of call trace that no one can tolerate.
The down side is, for some kind of bugs the developer may not need the
call trace at all. And the call trace might make users unnecessarily
nervous. That's why we have this email thread -- to catch regressions
indicated by some 1-liner kernel warnings. For that kind of situation,
it may be enough to add a common prefix for such messages. For example,
[kernel bug] your warning/error printk message
~~~~~~~~~~~~
Alternatively, some printk messages already clearly stated
"please report" (they seem more targeted for users than testers):
wfg /c/linux% git grep '".*please report'|head
block/elevator.c:645: "(nr_sorted=%u), please report this\n",
drivers/ata/pata_hpt37x.c:867: pr_err("Unknown HPT366 subtype, please report (%d)\n",
drivers/ata/pata_hpt37x.c:908: pr_err("PCI table is bogus, please report (%d)\n", dev->device);
drivers/ata/pata_hpt3x2n.c:532: pr_err("PCI table is bogus, please report (%d)\n", dev->device);
drivers/edac/mce_amd.c:725: " please report on LKML.\n");
drivers/gpu/drm/amd/amdgpu/amdgpu_display.c:395: DRM_INFO(" DDC: no ddc bus - possible BIOS bug - please report to xorg-driver-ati@xxxxxxxxxxx\n");
drivers/gpu/drm/nouveau/nouveau_bios.c:1458: "please report\n");
drivers/gpu/drm/radeon/radeon_display.c:803: DRM_INFO(" DDC: no ddc bus - possible BIOS bug - please report to xorg-driver-ati@xxxxxxxxxxx\n");
drivers/hwmon/ibmaem.c:773: "Unknown AEM v%d; please report this to the maintainer.\n",
drivers/hwmon/w83781d.c:1367: "If reset=1 solved a problem you were having, please report!\n");
Thanks,
Fengguang