Re: [PATCH] perf record: Delete file if a failure occurs writingthe perf data file

From: David Ahern
Date: Tue Nov 12 2013 - 09:51:27 EST


Ingo:

On 11/11/13, 7:43 AM, David Ahern wrote:
On 11/11/13, 2:37 AM, Ingo Molnar wrote:

* David Ahern <dsahern@xxxxxxxxx> wrote:

If perf fails to write data to the data file (e.g., ENOSPC error) it
fails
with the message:
failed to write perf data, error: No space left on device

and stops â killing the workload too. The file is an unknown state.
Trying to read it (e.g., perf report) fails with a SIGBUS error.

Ouch - guys please first investiage that SIGBUS, we should not behave
unexpectedly on _any_ (read: random) perf.data file contents. The SIGBUS
likely suggests that the parsing isn't robust enough.

If you agree with the below summary then any further objections to deleting the file on write failure?

David


I think we know why the SIGBUS is happening. From 'man mmap':


From man mmap:
SIGBUS Attempted access to a portion of the buffer that
does not correspond to the file (for example, beyond
the end of the file, ...


With regards to perf-record, on a write() failure the header is not
updated. From a recent change we try to proceed even though the data
size is 0 - parsing the events we can. We finally hit upon an event that
is only partially in the file (eg., header, but no data for event).
Trying to read the event data leads to the SIGBUS:

Running perf-report in gdb:

WARNING: The /tmp/mnt/perf.data file's data size field is 0 which is
unexpected.
Was the 'perf record' command properly terminated?


Program received signal SIGBUS, Bus error.
perf_evsel__parse_sample (evsel=0x94eec0, event=0x7ffff7ed9d80,
data=0x7fffffffd260)
at util/evsel.c:1242
1242 u16 max_size = event->header.size;
(gdb) bt
#0 perf_evsel__parse_sample (evsel=0x94eec0, event=0x7ffff7ed9d80,
data=0x7fffffffd260)
at util/evsel.c:1242
#1 0x000000000047c9ce in flush_sample_queue (s=0x94e2b0,
tool=0x7fffffffde80)
at util/session.c:542
#2 0x000000000047e2d4 in __perf_session__process_events (session=0x94e2b0,
data_offset=<optimized out>, data_size=<optimized out>,
file_size=1048576, tool=0x7fffffffde80)
at util/session.c:1388
#3 0x000000000042993c in __cmd_report (rep=0x7fffffffde80) at
builtin-report.c:509
#4 cmd_report (argc=0, argv=0x7fffffffe370, prefix=<optimized out>) at
builtin-report.c:967
#5 0x000000000041b063 in run_builtin (p=0x7cdf28, argc=4,
argv=0x7fffffffe370) at perf.c:319
#6 0x000000000041a8e3 in handle_internal_command (argv=0x7fffffffe370,
argc=4) at perf.c:376
#7 run_argv (argv=0x7fffffffe180, argcp=0x7fffffffe18c) at perf.c:420
#8 main (argc=4, argv=0x7fffffffe370) at perf.c:521


Fix by deleting the file on a failure.

That only works around the issue - if the same data file is produced by
some other method (or maliciously) then perf report will still SIGBUS ...

We could handle SIGBUS in the analysis commands too. See the suggestion
I had for handling the output failure using the mmap output option which
uses lngjmp.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/