Re: [PATCH] tracing: Fix oops caused by graph notrace filter

From: Chunyu Hu
Date: Thu Jun 23 2016 - 10:20:41 EST




----- Original Message -----
> From: "Steven Rostedt" <rostedt@xxxxxxxxxxx>
> To: "Chunyu Hu" <chuhu@xxxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Sent: Wednesday, June 22, 2016 9:08:14 PM
> Subject: Re: [PATCH] tracing: Fix oops caused by graph notrace filter
>
> On Tue, 21 Jun 2016 18:57:46 +0800
> Chunyu Hu <chuhu@xxxxxxxxxx> wrote:
>
> > wakeup tracer can use function_graph trace when display_graph trace
> > option is setup by user via tracefs, and bypass the set_graph_function
> > and set_graph_notrace. But the bypass of set_graph_notrace is not clean.
> > Although wakeup_graph_entry does most of the bypass, and both the enry
> > and exit event will be submitted to the trace ring buffer, the ret_stack
> > index, which will be assigned to depth field of graph enrty event is not
> > handled. The issue is that the depth is used as the array index of
> > fgraph_cpu_data and can cause an oops when it's negative. irqsoff tracer
> > has same issue. To see the oops:
> >
> > echo 1 > options/display_graph
> > echo schedule > set_graph_notrace
> > echo wakeup > current_tracer
> > cat trace
> > cat trace
>
> I'm not able to trigger an oops, but I do see them being traced.

Actually this needs to 'cat trace' for more than two times, after collecting
three graph entry events at least, as the print showed the depth was
-65534. One example call trace is like this, it's usually the same call trace:

[73710.161617] BUG: unable to handle kernel paging request at ffffe8fdf79836e0
[73710.168616] IP: [<ffffffff81161254>] print_graph_entry+0x3c4/0x3e0
[73710.174815] PGD 272759067 PUD 272758067 PMD 0
[73710.179313] Oops: 0002 [#1] SMP
[73710.182454] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ocrdma ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper be2iscsi ipmi_ssif cryptd ipmi_devintf cdc_ether iscsi_boot_sysfs libiscsi sg ipmi_si scsi_transport_iscsi iTCO_wdt usbnet pcspkr iTCO_vendor_support mii ipmi_msghandler ioatdma shpchp dca lpc_ich i2c_i801 i7core_edac mfd_core edac_core acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_generic pata_acpi drm mptsas ata_piix scsi_transport_sas mptscsih libata crc32c_intel serio_raw be2net mptbase i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod
[73710.269323] CPU: 8 PID: 31611 Comm: cat Not tainted 4.7.0-rc3 #1
[73710.275325] Hardware name: IBM System x3550 M3 -[7944I21]-/69Y4438 , BIOS -[D6E157AUS-1.15]- 06/13/2012
[73710.285054] task: ffff88007f213d80 ti: ffff88007f238000 task.ti: ffff88007f238000
[73710.292528] RIP: 0010:[<ffffffff81161254>] [<ffffffff81161254>] print_graph_entry+0x3c4/0x3e0
[73710.301149] RSP: 0018:ffff88007f23bc68 EFLAGS: 00010286
[73710.306458] RAX: ffffe8fdf7a036c0 RBX: ffff88007f2f0000 RCX: ffffffff810ef390
[73710.313585] RDX: ffffffffffff0002 RSI: 0000000000000282 RDI: ffffffff81c81de0
[73710.320713] RBP: ffff88007f23bca8 R08: 0000000000000003 R09: ffff880444a69ea0
[73710.327839] R10: 0000000000000001 R11: 000000000000000b R12: ffff88007f23bce0
[73710.334966] R13: ffff88007f2f1098 R14: ffff880444a69ea0 R15: 0000000000000038
[73710.342094] FS: 00007fda5c740740(0000) GS:ffff880277b00000(0000) knlGS:0000000000000000
[73710.350177] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[73710.355918] CR2: ffffe8fdf79836e0 CR3: 0000000033c32000 CR4: 00000000000006e0
[73710.363044] Stack:
[73710.365059] ffff8802730ce400 ffff880444a69ea0 000000007f23bcc8 ffff88007f2f0000
[73710.372524] ffff88007f2f1098 0000000000000000 0000000000000038 ffff88027557698c
[73710.379991] ffff88007f23bd28 ffffffff81161329 0000000000000aa8 0000000000000994
[73710.387456] Call Trace:
[73710.389906] [<ffffffff81161329>] print_graph_function_flags+0xb9/0x500
[73710.396514] [<ffffffff810ef390>] ? rcu_sched_qs+0x60/0x60
[73710.401998] [<ffffffff8115ea10>] wakeup_print_line+0x20/0x30
[73710.407740] [<ffffffff81158524>] print_trace_line+0x54/0x510
[73710.413481] [<ffffffff81159389>] s_show+0x29/0x150
[73710.418357] [<ffffffff8123ddbe>] seq_read+0x22e/0x370
[73710.423495] [<ffffffff81218b67>] __vfs_read+0x37/0x150
[73710.428716] [<ffffffff81218b35>] ? __vfs_read+0x5/0x150
[73710.434027] [<ffffffff812c9413>] ? security_file_permission+0xa3/0xc0
[73710.440550] [<ffffffff81218b35>] ? __vfs_read+0x5/0x150
[73710.445860] [<ffffffff8121912e>] vfs_read+0x8e/0x140
[73710.450908] [<ffffffff8121a625>] SyS_read+0x55/0xc0
[73710.455872] [<ffffffff81003b12>] do_syscall_64+0x62/0x110
[73710.461356] [<ffffffff816c65a1>] entry_SYSCALL64_slow_path+0x25/0x25
[73710.467791] Code: 42 18 49 89 41 38 48 8b 42 20 49 89 41 40 48 8b 42 28 49 89 41 48 4c 89 e0 e9 93 fd ff ff 89 50 08 e9 95 fc ff ff 49 8b 4c 24 08 <48> 89 4c d0 10 e9 2f ff ff ff 4c 89 e0 e9 74 fd ff ff 66 2e 0f
[73710.487945] RIP [<ffffffff81161254>] print_graph_entry+0x3c4/0x3e0
[73710.494229] RSP <ffff88007f23bc68>
[73710.497714] CR2: ffffe8fdf79836e0


> >
> > Making ftrace_graph_notrace_addr always return false when tracers need
> > to bypass it is a proposed fix.
>
> There's a lot of things wrong with this patch, but before we go there,

Thanks for the attention. Agree :/. , I found one wrong, I read the code
lines again, the graph_ret event won't be added to ring buffer in this case.
And the trying to bypass is not so wise, I was thinking why filters were
ignored in this tracer? overhead reason? Looks like it's not so.

> why should the latency tracers ignore set_graph_notrace?

There are two directions. I went to the wrong. If we resolve this through
enabling this graph notrace filter, it's also providing a small extra function
of this tracer.

> What about this patch?
>
> -- Steve
>
> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> index 03cdff84d026..68fcee5097d2 100644
> --- a/kernel/trace/trace_irqsoff.c
> +++ b/kernel/trace/trace_irqsoff.c
> @@ -175,6 +175,9 @@ static int irqsoff_graph_entry(struct ftrace_graph_ent
> *trace)
> int ret;
> int pc;
>
> + if (trace->depth < 0)
> + return 0;
> +

I tried the patch, if just for resolving the oops issue, it's indeed enough.
But if we care about just this filter, and compare this with the trace_graph_entry
filter handling, I found that, the depth 0 item will be submitted to the ring buffer, not
respecting the filter. I still use the same steps. It's possible to show schedule entry:

419.456410 | <idle>-0 | .N.. | schedule_preempt_disabled() {
419.456410 | <idle>-0 | .N.. | schedule() {
------------------------------------------
3) <idle>-0 => rngd-1979
------------------------------------------

419.456425 | rngd-1979 | d... 0.405 us | finish_task_switch();
419.456428 | rngd-1979 | .... | __fdget() {
419.456428 | rngd-1979 | .... 0.913 us | __fget_light();
419.456430 | rngd-1979 | .... 2.103 us | }
419.456431 | rngd-1979 | .... 0.241 us | random_poll();
419.456433 | rngd-1979 | .... | poll_schedule_timeout() {
419.456433 | rngd-1979 | .... | schedule_hrtimeout_range() {
419.456434 | rngd-1979 | .... | schedule_hrtimeout_range_clock() {
419.456435 | rngd-1979 | .... | schedule() {
419.456444 | rngd-1979 | d... 0.000 us | __schedule();
419.456445 | rngd-1979 | d... | /* 1979:120:S ==> [003] 1494:100:R kworker/3:1H */

And compared with trace_graph_entry function, don't we need both the two check?

+ if (trace->depth < 0)
+ return 0;
+ if (ftrace_graph_notrace_addr(trace->func))
+ return 1;



> if (!func_prolog_dec(tr, &data, &flags))
> return 0;
>
> diff --git a/kernel/trace/trace_sched_wakeup.c
> b/kernel/trace/trace_sched_wakeup.c
> index 9d4399b553a3..ce9889bff8b4 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -239,6 +239,9 @@ static int wakeup_graph_entry(struct ftrace_graph_ent
> *trace)
> unsigned long flags;
> int pc, ret = 0;
>
> + if (trace->depth < 0)
> + return 0;
> +

Do't we need this check?
+ if (ftrace_graph_notrace_addr(trace->func))
+ return 1;

> if (!func_prolog_preempt_disable(tr, &data, &pc))
> return 0;
>
>

--
Regards,
Chunyu Hu