Re: ftrace: trace_pipe_raw interface broken

From: David Sharp
Date: Wed Dec 22 2010 - 18:47:52 EST


On Wed, Dec 22, 2010 at 3:23 PM, David Sharp <dhsharp@xxxxxxxxxx> wrote:
> On Mon, Dec 20, 2010 at 9:59 PM, David Sharp <dhsharp@xxxxxxxxxx> wrote:
>> On Mon, Dec 20, 2010 at 7:52 PM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>>> Hi David,
>>>
>>> Thanks for the report.
>>>
>>> On Mon, 2010-12-20 at 18:43 -0800, David Sharp wrote:
>>>> trace_pipe_raw doesn't work correctly, I think because
>>>> tracing_buffers_read() uses ring_buffer_read_page() incorrectly.
>>>>
>>>> In particular, if you read less than one page at a time, then you get
>>>> much more data than expected from the ring buffer:
>>>>
>>>> $ ftrace=/sys/kernel/debug/tracing
>>>> $ echo 0 > $ftrace/tracing_enabled
>>>> $ echo 1 > $ftrace/events/enable
>>>> $ echo > $ftrace/trace Â# reset
>>>> $ echo 100 > $ftrace/buffer_size_kb
>>>> $ Â# generate lots of events:
>>>> $ echo 1 > $ftrace/tracing_enabled && Âls -laR /proc/ > /dev/null 2>&1
>>>> && echo 0 > $ftrace/tracing_enabled
>>>> $ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>> 675840
>>>>
>>>> That's significantly more data than the buffer could possibly contain.
>>>> Further shrinking bs increases the amount of data output even more.
>>>
>>> Yep that's a side effect. What happens is that we always read a "full
>>> page" even if we do not read in page size chunks. If we read 100 bytes,
>>> it will read a header, and all the events that fit inside that 100
>>> bytes, and then return a full page with the header, the 100 bytes -
>>> header size, and the rest of the page filled with zeros.
>>>
>>> Do a:
>>>
>>> Â Â Â Âhexdump -C outputfile
>>>
>>> and you'll see a lot of "zeros". The reason for this is to keep all
>>> reads in pagesize chunks. If you read only 100 bytes of data, I still
>>> need to keep that page aligned in the output. The reason I do this is in
>>> case there's two readers. I don't want one to read part of a page that
>>> includes the header, and then the other reads the rest of the page. The
>>> header tells you how much data is on that page.
>>
>> aha, I see. My assumption that the page headers weren't a needed part
>> of the trace output was throwing off my analysis.
>>
>>> I guess I could have had the info->spare just read the full page, and
>>> then other readers would just read the next page.
>>
>> I don't think multiple readers a use case we're concerned about, so
>> either way as far as I'm concerned.
>>
>>>> Even when reading >=4096 bytes, something is funny:
>>>>
>>>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>> 118784
>>>>
>>>> That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
>>>> pages, or 104kb-worth of pages. 116kB is about 28 pages.
>>>
>>> Hmm, isn't page data 4080, not 4088, since it has 2, 8 byte items in it.
>>> Timestamp and atomic (on 64bit). 4096 - 16 = 4080. But still the numbers
>>> are not much different.
>>
>> oops, yeah. I rounded 2*8 = 8 Â;)
>>
>>> Note, when I did it, I got:
>>>
>>> $ dd if=/debug/tracing/per_cpu/cpu0/trace_pipe_raw of=/tmp/dat bs=4096
>>> 27+0 records in
>>> 27+0 records out
>>> 110592 bytes (111 kB) copied, 0.00302891 s, 36.5 MB/s
>>>
>>> $ cat /debug/tracing/buffer_size_kb
>>> 100
>>>
>>> Ah, running hexdump -C, I got this at the end:
>>>
>>> 00019c80 Â62 57 00 00 22 d1 09 81 Âff ff ff ff 76 d1 09 81 Â|bW.."Ã..ÃÃÃÃvÃ..|
>>> 00019c90 Âff ff ff ff 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|ÃÃÃÃ............|
>>> 00019ca0 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>>> *
>>> 0001a000 Âd8 32 b6 7e ed 53 05 00 Â24 00 00 00 00 00 00 00 Â|Ã2Â~ÃS..$.......|
>>> 0001a010 Âa8 36 00 00 01 00 00 00 Âca 0b 00 00 ff ff ff ff Â|Å6......Ã...ÃÃÃÃ|
>>> 0001a020 Â00 00 00 00 6a d1 09 81 Âff ff ff ff 68 26 16 81 Â|....jÃ..ÃÃÃÃh&..|
>>> 0001a030 Âff ff ff ff 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|ÃÃÃÃ............|
>>> 0001a040 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>>> *
>>> 0001b000
>>>
>>> I don't know why it added that last page. I'll have to look into this.
>>
>> Thanks for taking a look!
>>
>> I'm actually seeing up to 3 extra pages (29 records) frequently,
>> although it varies. I even just got 30 pages.
>>
>> aha, check this out though, looking at all the page headers:
>>
>> $ hexdump -C /tmp/trace_raw | grep '^[0-9a-f]*000 '
>> 00000000 Â7d 9a 19 9a fd 01 00 00 Â74 0d 00 00 00 00 00 00 Â|}...ï...t.......|
>> 00001000 Â12 22 4a 71 01 02 00 00 Â38 00 00 00 00 00 00 00 Â|."Jq....8.......|
>> 00002000 Â62 99 60 d7 01 02 00 00 Âb8 0f 00 c0 ff ff ff ff Â|b.`ï....ï..ïïïïï|
>> # negative commit count? or, notice this is the third page.
>> # Maybe there's garbage at the beginning, rather than the end.
>> 00003000 Âe4 cb e2 dc 01 02 00 00 Âf0 0f 00 00 00 00 00 00 Â|ïïïï....ï.......|
>> # <snip>
>> 00019000 Â7a 79 01 ef 01 02 00 00 Âdc 0f 00 00 00 00 00 00 Â|zy.ï....ï.......|
>> 0001a000 Âc7 51 05 ef 01 02 00 00 Âc0 0f 00 00 00 00 00 00 Â|ïQ.ï....ï.......|
>> 0001b000 Âaf fe 1b ef 01 02 00 00 Â34 08 00 00 00 00 00 00 Â|ïï.ï....4.......|
>> 0001c000 Â49 f3 1f ef 01 02 00 00 Â2c 00 00 00 00 00 00 00 Â|Iï.ï....,.......|
>>
>> Interestingly, they all have valid-looking timestamps (in that they
>> are monotonic), even before the negative commit count.
>>
>> Here's the one where I got 30 pages:
>>
>> 00000000 Â1d f5 1f ef 01 02 00 00 Â68 07 00 00 00 00 00 00 Â|.ï.ï....h.......|
>> 00000770 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>> *
>> 00001000 Âbf 04 59 1f 4a 02 00 00 Â28 00 00 00 00 00 00 00 Â|ï.Y.J...(.......|
>> 00001030 Â00 00 00 00 ff ff ff ff Â00 00 00 00 00 00 00 00 Â|....ïïïï........|
>> 00001040 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>> *
>> 00002000 Â6b 0b 59 1f 4a 02 00 00 Â28 00 00 00 00 00 00 00 Â|k.Y.J...(.......|
>> 00002030 Âff ff ff ff 1d 00 00 00 Â00 00 00 00 00 00 00 00 Â|ïïïï............|
>> 00002040 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>> *
>> 00003000 Âdf b5 6d 68 4a 02 00 00 Âd4 0f 00 c0 ff ff ff ff Â|ßmhJ...ï..ïïïïï|
>> # negative commit count in the 4th page this time. I see a pattern here...
>> # also, lots of zeroes in the first 3 pages.
>> 00004000 Â94 9b be 6c 4a 02 00 00 Âd4 0f 00 00 00 00 00 00 Â|..ïlJ...ï.......|
>> # <snip>
>> 00019000 Â88 69 4f 91 4a 02 00 00 Âcc 0f 00 00 00 00 00 00 Â|.iO.J...ï.......|
>> 0001a000 Â6f 4d ab 97 4a 02 00 00 Âe4 0f 00 00 00 00 00 00 Â|oMï.J...ï.......|
>> 0001b000 Â4c 98 2d 9e 4a 02 00 00 Âd4 0f 00 00 00 00 00 00 Â|L.-.J...ï.......|
>> 0001c000 Âf0 92 57 a3 4a 02 00 00 Â6c 0e 00 00 00 00 00 00 Â|ï.WïJ...l.......|
>> 0001d000 Â6e f9 24 a4 4a 02 00 00 Â2c 00 00 00 00 00 00 00 Â|nï$ïJ...,.......|
>>
>> I don't know what this means yet. Is a negative commit count ever
>> expected? ring_buffer_read_page does add 1<<30 and 1<<31 when there
>> are events missing, but I don't think that would result in these bit
>> patterns.
>>
>> That's enough for today. I'll investigate more tomorrow.
>
> I did a test that yielded some interesting insights:
>
> $ # write the current time to trace_mark continuously from cpu0:
> $ taskset -c 0 sh -c 'while true ; do echo "time: $(date +%s.%N)" \
> Â> /sys/kernel/debug/tracing/trace_marker ; done' &
> $ # Let the buffer fill, and note the start and end times:
> $ echo "start: $(date +%s.%N)" ; echo 1 > $ftrace/tracing_enabled ; sleep 12 ; \
> Âecho 0 > $ftrace/tracing_enabled ; echo "end: Â $(date +%s.%N)"
> start: 1293051826.630452621
> end: Â 1293051838.634052658
> $ # gather result
> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw of=/tmp/trace_raw
> 30+0 records in
> 30+0 records out
> 122880 bytes (123 kB) copied, 0.000300114 s, 409 MB/s
>
> Okay, 30 pages. 4 too many, but I've seen that before.
>
> Now, the data. Page 1:
>
> 00000000 Â61 17 63 55 ab 82 00 00 Âac 0f 00 00 00 00 00 00 Â|a.cUï...ï.......|
> 00000010 Â1e 6c b4 f7 04 01 00 00 Â0c 00 00 00 05 00 00 00 Â|.lïï............|
> 00000020 Â66 44 00 00 ff ff ff ff Â00 00 00 00 00 00 00 00 Â|fD..ïïïï........|
> 00000030 Â74 69 6d 65 3a 20 31 32 Â39 33 30 35 31 38 32 36 Â|time: 1293051826|
> 00000040 Â2e 36 33 31 32 34 31 31 Â36 39 0a 00 0c 00 00 00 Â|.631241169......|
>
> okay, a small time lag (789 microsec) before the trace starts. (If
> you're asking why you're seeing the beginning of the trace and how
> there could have been no overflow, you're in the advanced class. I
> didn't realize this until Page 4...)
>
> btw, the trace_marks are occurring on the order of 1ms apart.
>
> Page 1 ends at 0fbc, which fits the 0fac value for "commit" in the page header.
>
> Now, the whole of Page 2:
>
> 00001000 Â70 25 e0 81 b3 82 00 00 Â34 00 00 00 00 00 00 00 Â|p%ï.ï...4.......|
> 00001010 Â6c df ff 01 05 00 00 00 Â66 44 00 00 ff ff ff ff Â|lïï.....fD..ïïïï|
> 00001020 Â00 00 00 00 00 00 00 00 Â74 69 6d 65 3a 20 31 32 Â|........time: 12|
> 00001030 Â39 33 30 35 31 38 32 36 Â2e 37 31 32 31 31 34 34 Â|93051826.7121144|
> 00001040 Â36 30 0a 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|60..............|
>
> Only a single event, nothing remarkable about it other than that it is
> on a separate page
>
> Page 3 has only a padding event in it:
>
> 00002000 Â6b 24 f0 81 b3 82 00 00 Â10 00 00 00 00 00 00 00 Â|k$ï.ï...........|
> 00002010 Â3d 00 00 00 0c 00 00 00 Â00 00 00 00 00 00 00 00 Â|=...............|
>
> So 0x0fac + 0x34 + 0x10 = 0xff0, plus 16 bytes of header makes one
> page. So, it seems that was the real first page of the ring buffer.
> The question is, why did it get split up?
>
> Here's where it gets more interesting. Here's Page 4:
>
> 00003000 Â25 56 1a cb b5 82 00 00 Âd8 0f 00 c0 ff ff ff ff Â|%V.Ë...ï..ïïïïï|
> 00003010 Â0c 00 00 00 05 00 00 00 Â66 44 00 00 ff ff ff ff Â|........fD..ïïïï|
> 00003020 Â00 00 00 00 00 00 00 00 Â74 69 6d 65 3a 20 31 32 Â|........time: 12|
> 00003030 Â39 33 30 35 31 38 33 36 Â2e 35 34 34 36 31 30 31 Â|93051836.5446101|
> 00003040 Â37 39 0a 00 8c c7 0f 02 Â05 00 00 00 66 44 00 00 Â|79...ï......fD..|
>
> Remember the last event on page 2 was at 1293051826.712114460. There
> are 10 seconds missing!
>
> Also, still seeing that strange value for "commit" (0xffffffff0c000fd8).
>
> Between the first event on Page4 and the last event, the events seem
> to be contiguous, very roughly 1msec apart.
>
> The last event of the trace is the only event on the last page:
>
> 0001d000 Â50 dc 41 47 b6 82 00 00 Â34 00 00 00 00 00 00 00 Â|PïAGï...4.......|
> 0001d010 Â2c b7 1a 02 05 00 00 00 Â66 44 00 00 ff ff ff ff Â|,ï......fD..ïïïï|
> 0001d020 Â00 00 00 00 00 00 00 00 Â74 69 6d 65 3a 20 31 32 Â|........time: 12|
> 0001d030 Â39 33 30 35 31 38 33 38 Â2e 36 33 31 38 37 38 37 Â|93051838.6318787|
> 0001d040 Â39 34 0a 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|94..............|
>
> That's at 1293051838.631878794, compared to the recorded "end" time
> 1293051838.634052658 (diff is 2,173 microsec). So, that very likely is
> the real end of the trace.
>
> This leads to a theory that this has something to do with overflow:
>
> $ cat per_cpu/cpu0/stats
> entries: 0
> overrun: 16770
> commit overrun: 0
>
> 16770 events where each event is on the order of 1msec apart roughly
> accounts for 10 seconds of missing time.
>
> So, it seems that the ring buffer started out writing to what
> eventually became the initial reader page. But then as overflow
> occurs, that page does not get overwritten.
>
> There are three outstanding questions in my mind:
> 1) Why are we seeing the "first" page, which either should have been
> overwritten, or should not be seen at all?
> 2) As before: Why the strange "commit" value in the page header for
> the first contiguous-event page (Page 4)?
> 3) Why is that first page getting split up?
>
> I think I have an idea about (3):
>
> in ring_buffer_read_page(), there are three conditions for reading out
> a page on event at a time. One of them is "reader_page ==
> commit_page". That seems plausible for a full ring buffer, but I
> haven't fully groked that.
>
> But, within that path, it seems it should still copy out the whole
> page. But, the loop condition for copying events is "while (len >
> size)" where len is the remaining available size in the
> caller-supplied page, and size is the size of the next event in the
> ring buffer. So, that's off-by-one. if len == size, then there's just
> enough space.
>
> Does this fit? 0x1000 - 0xfbc = 68 bytes for len, but the event in
> Page 2 is 52 bytes... So no. hmm, another question.
>
> But, size is set to "rb_event_ts_length(event)", to ensure space for
> time extend events to be paired with the event after it. Then in the
> next loop, "size" bytes are memcpy'd to the caller-supplied page.
> HOWEVER, the reader is then advanced with rb_advance_reader(), which
> *does not* advance twice for time-extend events. So, the memcpy will
> have copied two events, and "pos" is advanced by "size", and "len" is
> decremented by "size", but then the reader is advanced by *only* one
> event! This means that time-extend events should be followed by
> duplicate events. Sure enough, the first event on the Page 1 is a time
> extend, and it is followed by two identical events (I only show the
> first word of the duplicate above, though).
>
> A patch for these two problems is incoming.
>
> Sorry for these long messages. If this is too much detail, I can scale it back.
>

Okay. Now I've tested with the patch (sent separately).

- Extra pages are only seen when I don't reset the buffer before
running the test again.
- So far, when I've seen extra pages, I've only seen one extra page.
- When there's an extra page, it's still full of pre-overflow data, as
opposed to, say, data from the previous test.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/