Re: perf: fuzzer triggered trouble on AMD, maybe ibs related

From: Stephane Eranian
Date: Wed Oct 28 2015 - 03:05:30 EST


On Thu, Oct 22, 2015 at 6:46 PM, Vince Weaver <vincent.weaver@xxxxxxxxx> wrote:
> Hello
>
> I've been busy but finally had a chance to run perf_fuzzer on current git.
> I am running on an AMD A10 system (my traditional Haswell system is
> otherwise occupied).
>
> I got the following WARNING which was followed by an NMI storm which
> eventually managed to confuse ext4 enough that my / partition was
> remounted read-only? Very alarming.
>
> This is in static void perf_ibs_start(struct perf_event *event, int flags)
>
> if (WARN_ON_ONCE(!(hwc->state & PERF_HES_STOPPED)))
> return;
>
Was able to reproduce a similar warning in generic x86 code:

[ 2357.625987] WARNING: CPU: 2 PID: 17152 at
arch/x86/kernel/cpu/perf_event.c:1209 x86_pmu_start+0xa2/0x100()
[ 2357.635775] Modules linked in: cfg80211 snd_hda_codec_realtek
snd_hda_codec_generic snd_hda_intel snd_hda_codec kvm_amd kvm
snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event
snd_rawmidi snd_seq crct10dif_pclmul crc32_pclmul snd_seq_device
snd_timer aesni_intel snd eeepc_wmi asus_wmi aes_x86_64 sparse_keymap
lrw video gf128mul glue_helper edac_mce_amd ablk_helper cryptd shpchp
edac_core wmi soundcore i2c_piix4 serio_raw 8250_fintek k10temp
fam15h_power mac_hid parport_pc ppdev lp parport autofs4 psmouse r8169
ahci libahci mii
[ 2357.687313] CPU: 2 PID: 17152 Comm: perf_fuzzer Not tainted 4.3.0-rc7+ #1
[ 2357.694212] Hardware name: To be filled by O.E.M. To be filled by
O.E.M./M5A97 PRO, BIOS 1604 10/16/2012
[ 2357.703829] ffffffff81a9f3e0 ffff88021ec83d80 ffffffff8139bed4
0000000000000000
[ 2357.711636] ffff88021ec83db8 ffffffff81078f26 ffff88021ec8c040
ffff8800c9f85000
[ 2357.719430] 0000000000000001 ffff8802131d4868 ffff8802131d4800
ffff88021ec83dc8
[ 2357.727158] Call Trace:
[ 2357.729657] <IRQ> [<ffffffff8139bed4>] dump_stack+0x44/0x60
[ 2357.735573] [<ffffffff81078f26>] warn_slowpath_common+0x86/0xc0
[ 2357.746342] [<ffffffff8107901a>] warn_slowpath_null+0x1a/0x20
[ 2357.756968] [<ffffffff8102b882>] x86_pmu_start+0xa2/0x100
[ 2357.767071] [<ffffffff81169bd9>] perf_event_task_tick+0x239/0x270
[ 2357.777894] [<ffffffff810a2c2b>] scheduler_tick+0x7b/0xd0
[ 2357.788053] [<ffffffff810efbc0>] ? tick_sched_do_timer+0x30/0x30
[ 2357.798693] [<ffffffff810e0ef1>] update_process_times+0x51/0x60
[ 2357.809102] [<ffffffff810ef5e5>] tick_sched_handle.isra.15+0x25/0x60
[ 2357.819956] [<ffffffff810efc00>] tick_sched_timer+0x40/0x70
[ 2357.829943] [<ffffffff810e1a34>] __hrtimer_run_queues+0xe4/0x200
[ 2357.840398] [<ffffffff810e1e58>] hrtimer_interrupt+0xa8/0x1a0
[ 2357.850522] [<ffffffff8104de58>] local_apic_timer_interrupt+0x38/0x60
[ 2357.861370] [<ffffffff8179cca4>] smp_trace_apic_timer_interrupt+0x44/0xab
[ 2357.872524] [<ffffffff8179afb2>] trace_apic_timer_interrupt+0x82/0x90
[ 2357.883314] <EOI>

This can be explained if the event is not in the cpuc->active_mask as
per code in
x86_pmu_stop() vs x86_pmu_start(). I am investigating some more....


> [ 359.629045] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/perf_event_amd_ibs.c:372 perf_ibs_start+0x43/0x131()
> [ 359.639091] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc nls_utf8 nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi kvm_amd kvm sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 snd_hda_intel ablk_helper cryptd snd_hda_codec lrw snd_hda_core gf128mul glue_helper ppdev snd_hwdep hp_wmi snd_pcm evdev sparse_keymap snd_timer pl2303 radeon ttm drm_kms_helper tpm_infineon pcspkr drm efivars psmouse serio_raw i2c_piix4 i2c_algo_bit usbserial fb_sys_fops shpchp k10temp parport_pc snd syscopyarea i2c_core parport soundcore tpm_tis wmi sysfillrect button tpm sysimgblt acpi_cpufreq processor sg sr_mod cdrom sd_mod ohci_pci ahci libahci tg3 xhci_pci ptp pps_core libata xhci_hcd ohci_hcd ehci_pci libphy ehci_hcd crc32c_intel
> [ 359.711502] scsi_mod usbcore usb_common
> [ 359.714203] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.3.0-rc6+ #12
> [ 359.721804] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013
> [ 359.730808] 0000000000000006 ffffffff8123e6b7 0000000000000000 ffffffff8104519a
> [ 359.738322] ffffffff8102a003 ffff880224098c00 ffffe8ffffc036d0 ffffffff81824ec0
> [ 359.745832] ffff88022ec0f8e0 ffffffff8102a003 ffff880224098c00 ffffe8ffffc06a70
> [ 359.753328] Call Trace:
> [ 359.755793] <IRQ> [<ffffffff8123e6b7>] ? dump_stack+0x40/0x50
> [ 359.761762] [<ffffffff8104519a>] ? warn_slowpath_common+0x94/0xa9
> [ 359.767963] [<ffffffff8102a003>] ? perf_ibs_start+0x43/0x131
> [ 359.773730] [<ffffffff8102a003>] ? perf_ibs_start+0x43/0x131
> [ 359.779495] [<ffffffff810d8842>] ? perf_event_task_tick+0x101/0x1b5
> [ 359.785874] [<ffffffff8109476c>] ? tick_sched_do_timer+0x24/0x24
> [ 359.791990] [<ffffffff81063628>] ? scheduler_tick+0x64/0x7d
> [ 359.797673] [<ffffffff810896fd>] ? update_process_times+0x3b/0x45
> [ 359.803876] [<ffffffff810942d3>] ? tick_sched_handle+0x3e/0x4a
> [ 359.809820] [<ffffffff8109479b>] ? tick_sched_timer+0x2f/0x53
> [ 359.815676] [<ffffffff81089f55>] ? __hrtimer_run_queues+0xb9/0x18b
> [ 359.821967] [<ffffffff8108a1e8>] ? hrtimer_interrupt+0x61/0x101
> [ 359.827995] [<ffffffff8102d417>] ? smp_apic_timer_interrupt+0x20/0x2f
> [ 359.834549] [<ffffffff8141e58f>] ? apic_timer_interrupt+0x7f/0x90
> [ 359.840745] <EOI> [<ffffffff8133f769>] ? cpuidle_enter_state+0xf3/0x145
> [ 359.847579] [<ffffffff8106ebab>] ? cpu_startup_entry+0x170/0x1db
> [ 359.853694] [<ffffffff818eddfd>] ? start_kernel+0x40b/0x413
> [ 359.859371] ---[ end trace 93964ed985254224 ]---
> [ 360.468852] Uhhuh. NMI received for unknown reason 2d on CPU 2.
> [ 360.474790] Do you have a strange power saving mode enabled?
> [ 360.480454] Dazed and confused, but trying to continue
> [ 360.695032] Uhhuh. NMI received for unknown reason 2d on CPU 1.
> [ 360.700985] Do you have a strange power saving mode enabled?
> [ 360.706666] Dazed and confused, but trying to continue
> [ 361.739498] Uhhuh. NMI received for unknown reason 3d on CPU 0.
> [ 361.745438] Do you have a strange power saving mode enabled?
> [ 361.751104] Dazed and confused, but trying to continue
> [ 361.828053] Uhhuh. NMI received for unknown reason 3d on CPU 0.
> [ 361.833989] Do you have a strange power saving mode enabled?
> [ 361.839677] Dazed and confused, but trying to continue
>
> .....
>
> [ 468.763231] Dazed and confused, but trying to continue
> [ 468.794184] Uhhuh. NMI received for unknown reason 2d on CPU 2.
> [ 468.794184] Do you have a strange power saving mode enabled?
> [ 468.794184] Dazed and confused, but trying to continue
> [ 473.190535] sd 0:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
> [ 473.199631] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 39 93 49 d0 00 00 18 00
> [ 473.207789] blk_update_request: I/O error, dev sda, sector 965954000
> [ 473.214857] Aborting journal on device sda2-8.
> [ 473.214868] EXT4-fs (sda2): ext4_writepages: jbd2_start: 7158 pages, ino 27394094; err -30
> [ 473.214880] EXT4-fs (sda2): ext4_writepages: jbd2_start: 7168 pages, ino 27395265; err -30
> [ 473.215802] EXT4-fs (sda2): ext4_writepages: jbd2_start: 7168 pages, ino 27394094; err -30
> [ 473.215806] EXT4-fs (sda2): ext4_writepages: jbd2_start: 7168 pages, ino 27395265; err -30
> [ 473.215811] EXT4-fs (sda2): ext4_writepages: jbd2_start: 7168 pages, ino 27394094; err -30
> [ 473.215814] EXT4-fs (sda2): ext4_writepages: jbd2_start: 7168 pages, ino 27395265; err -30
> [ 473.215849] EXT4-fs (sda2): ext4_writepages: jbd2_start: 9223372036854775807 pages, ino 27394094; err -30
> [ 473.215859] EXT4-fs (sda2): ext4_writepages: jbd2_start: 9223372036854775807 pages, ino 27395265; err -30
> [ 473.409076] EXT4-fs error (device sda2): ext4_journal_check_start:56: Detected aborted journal
> [ 473.419003] EXT4-fs (sda2): Remounting filesystem read-only
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/