Re: ftrace: trace_pipe_raw interface broken

From: David Sharp
Date: Wed Dec 22 2010 - 19:37:49 EST


On Wed, Dec 22, 2010 at 3:47 PM, David Sharp <dhsharp@xxxxxxxxxx> wrote:
> On Wed, Dec 22, 2010 at 3:23 PM, David Sharp <dhsharp@xxxxxxxxxx> wrote:
>> On Mon, Dec 20, 2010 at 9:59 PM, David Sharp <dhsharp@xxxxxxxxxx> wrote:
>>> On Mon, Dec 20, 2010 at 7:52 PM, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>>>> Hi David,
>>>>
>>>> Thanks for the report.
>>>>
>>>> On Mon, 2010-12-20 at 18:43 -0800, David Sharp wrote:
>>>>> trace_pipe_raw doesn't work correctly, I think because
>>>>> tracing_buffers_read() uses ring_buffer_read_page() incorrectly.
>>>>>
>>>>> In particular, if you read less than one page at a time, then you get
>>>>> much more data than expected from the ring buffer:
>>>>>
>>>>> $ ftrace=/sys/kernel/debug/tracing
>>>>> $ echo 0 > $ftrace/tracing_enabled
>>>>> $ echo 1 > $ftrace/events/enable
>>>>> $ echo > $ftrace/trace Â# reset
>>>>> $ echo 100 > $ftrace/buffer_size_kb
>>>>> $ Â# generate lots of events:
>>>>> $ echo 1 > $ftrace/tracing_enabled && Âls -laR /proc/ > /dev/null 2>&1
>>>>> && echo 0 > $ftrace/tracing_enabled
>>>>> $ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>>> 675840
>>>>>
>>>>> That's significantly more data than the buffer could possibly contain.
>>>>> Further shrinking bs increases the amount of data output even more.
>>>>
>>>> Yep that's a side effect. What happens is that we always read a "full
>>>> page" even if we do not read in page size chunks. If we read 100 bytes,
>>>> it will read a header, and all the events that fit inside that 100
>>>> bytes, and then return a full page with the header, the 100 bytes -
>>>> header size, and the rest of the page filled with zeros.
>>>>
>>>> Do a:
>>>>
>>>> Â Â Â Âhexdump -C outputfile
>>>>
>>>> and you'll see a lot of "zeros". The reason for this is to keep all
>>>> reads in pagesize chunks. If you read only 100 bytes of data, I still
>>>> need to keep that page aligned in the output. The reason I do this is in
>>>> case there's two readers. I don't want one to read part of a page that
>>>> includes the header, and then the other reads the rest of the page. The
>>>> header tells you how much data is on that page.
>>>
>>> aha, I see. My assumption that the page headers weren't a needed part
>>> of the trace output was throwing off my analysis.
>>>
>>>> I guess I could have had the info->spare just read the full page, and
>>>> then other readers would just read the next page.
>>>
>>> I don't think multiple readers a use case we're concerned about, so
>>> either way as far as I'm concerned.
>>>
>>>>> Even when reading >=4096 bytes, something is funny:
>>>>>
>>>>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>>> 118784
>>>>>
>>>>> That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
>>>>> pages, or 104kb-worth of pages. 116kB is about 28 pages.
>>>>
>>>> Hmm, isn't page data 4080, not 4088, since it has 2, 8 byte items in it.
>>>> Timestamp and atomic (on 64bit). 4096 - 16 = 4080. But still the numbers
>>>> are not much different.
>>>
>>> oops, yeah. I rounded 2*8 = 8 Â;)
>>>
>>>> Note, when I did it, I got:
>>>>
>>>> $ dd if=/debug/tracing/per_cpu/cpu0/trace_pipe_raw of=/tmp/dat bs=4096
>>>> 27+0 records in
>>>> 27+0 records out
>>>> 110592 bytes (111 kB) copied, 0.00302891 s, 36.5 MB/s
>>>>
>>>> $ cat /debug/tracing/buffer_size_kb
>>>> 100
>>>>
>>>> Ah, running hexdump -C, I got this at the end:
>>>>
>>>> 00019c80 Â62 57 00 00 22 d1 09 81 Âff ff ff ff 76 d1 09 81 Â|bW.."Ã..ÃÃÃÃvÃ..|
>>>> 00019c90 Âff ff ff ff 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|ÃÃÃÃ............|
>>>> 00019ca0 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>>>> *
>>>> 0001a000 Âd8 32 b6 7e ed 53 05 00 Â24 00 00 00 00 00 00 00 Â|Ã2Â~ÃS..$.......|
>>>> 0001a010 Âa8 36 00 00 01 00 00 00 Âca 0b 00 00 ff ff ff ff Â|Å6......Ã...ÃÃÃÃ|
>>>> 0001a020 Â00 00 00 00 6a d1 09 81 Âff ff ff ff 68 26 16 81 Â|....jÃ..ÃÃÃÃh&..|
>>>> 0001a030 Âff ff ff ff 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|ÃÃÃÃ............|
>>>> 0001a040 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>>>> *
>>>> 0001b000
>>>>
>>>> I don't know why it added that last page. I'll have to look into this.
>>>
>>> Thanks for taking a look!
>>>
>>> I'm actually seeing up to 3 extra pages (29 records) frequently,
>>> although it varies. I even just got 30 pages.
>>>
>>> aha, check this out though, looking at all the page headers:
>>>
>>> $ hexdump -C /tmp/trace_raw | grep '^[0-9a-f]*000 '
>>> 00000000 Â7d 9a 19 9a fd 01 00 00 Â74 0d 00 00 00 00 00 00 Â|}...ï...t.......|
>>> 00001000 Â12 22 4a 71 01 02 00 00 Â38 00 00 00 00 00 00 00 Â|."Jq....8.......|
>>> 00002000 Â62 99 60 d7 01 02 00 00 Âb8 0f 00 c0 ff ff ff ff Â|b.`ï....ï..ïïïïï|
>>> # negative commit count? or, notice this is the third page.
>>> # Maybe there's garbage at the beginning, rather than the end.
>>> 00003000 Âe4 cb e2 dc 01 02 00 00 Âf0 0f 00 00 00 00 00 00 Â|ïïïï....ï.......|
>>> # <snip>
>>> 00019000 Â7a 79 01 ef 01 02 00 00 Âdc 0f 00 00 00 00 00 00 Â|zy.ï....ï.......|
>>> 0001a000 Âc7 51 05 ef 01 02 00 00 Âc0 0f 00 00 00 00 00 00 Â|ïQ.ï....ï.......|
>>> 0001b000 Âaf fe 1b ef 01 02 00 00 Â34 08 00 00 00 00 00 00 Â|ïï.ï....4.......|
>>> 0001c000 Â49 f3 1f ef 01 02 00 00 Â2c 00 00 00 00 00 00 00 Â|Iï.ï....,.......|
>>>
>>> Interestingly, they all have valid-looking timestamps (in that they
>>> are monotonic), even before the negative commit count.
>>>
>>> Here's the one where I got 30 pages:
>>>
>>> 00000000 Â1d f5 1f ef 01 02 00 00 Â68 07 00 00 00 00 00 00 Â|.ï.ï....h.......|
>>> 00000770 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>>> *
>>> 00001000 Âbf 04 59 1f 4a 02 00 00 Â28 00 00 00 00 00 00 00 Â|ï.Y.J...(.......|
>>> 00001030 Â00 00 00 00 ff ff ff ff Â00 00 00 00 00 00 00 00 Â|....ïïïï........|
>>> 00001040 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>>> *
>>> 00002000 Â6b 0b 59 1f 4a 02 00 00 Â28 00 00 00 00 00 00 00 Â|k.Y.J...(.......|
>>> 00002030 Âff ff ff ff 1d 00 00 00 Â00 00 00 00 00 00 00 00 Â|ïïïï............|
>>> 00002040 Â00 00 00 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|................|
>>> *
>>> 00003000 Âdf b5 6d 68 4a 02 00 00 Âd4 0f 00 c0 ff ff ff ff Â|ßmhJ...ï..ïïïïï|
>>> # negative commit count in the 4th page this time. I see a pattern here...
>>> # also, lots of zeroes in the first 3 pages.
>>> 00004000 Â94 9b be 6c 4a 02 00 00 Âd4 0f 00 00 00 00 00 00 Â|..ïlJ...ï.......|
>>> # <snip>
>>> 00019000 Â88 69 4f 91 4a 02 00 00 Âcc 0f 00 00 00 00 00 00 Â|.iO.J...ï.......|
>>> 0001a000 Â6f 4d ab 97 4a 02 00 00 Âe4 0f 00 00 00 00 00 00 Â|oMï.J...ï.......|
>>> 0001b000 Â4c 98 2d 9e 4a 02 00 00 Âd4 0f 00 00 00 00 00 00 Â|L.-.J...ï.......|
>>> 0001c000 Âf0 92 57 a3 4a 02 00 00 Â6c 0e 00 00 00 00 00 00 Â|ï.WïJ...l.......|
>>> 0001d000 Â6e f9 24 a4 4a 02 00 00 Â2c 00 00 00 00 00 00 00 Â|nï$ïJ...,.......|
>>>
>>> I don't know what this means yet. Is a negative commit count ever
>>> expected? ring_buffer_read_page does add 1<<30 and 1<<31 when there
>>> are events missing, but I don't think that would result in these bit
>>> patterns.
>>>
>>> That's enough for today. I'll investigate more tomorrow.
>>
>> I did a test that yielded some interesting insights:
>>
>> $ # write the current time to trace_mark continuously from cpu0:
>> $ taskset -c 0 sh -c 'while true ; do echo "time: $(date +%s.%N)" \
>> Â> /sys/kernel/debug/tracing/trace_marker ; done' &
>> $ # Let the buffer fill, and note the start and end times:
>> $ echo "start: $(date +%s.%N)" ; echo 1 > $ftrace/tracing_enabled ; sleep 12 ; \
>> Âecho 0 > $ftrace/tracing_enabled ; echo "end: Â $(date +%s.%N)"
>> start: 1293051826.630452621
>> end: Â 1293051838.634052658
>> $ # gather result
>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw of=/tmp/trace_raw
>> 30+0 records in
>> 30+0 records out
>> 122880 bytes (123 kB) copied, 0.000300114 s, 409 MB/s
>>
>> Okay, 30 pages. 4 too many, but I've seen that before.
>>
>> Now, the data. Page 1:
>>
>> 00000000 Â61 17 63 55 ab 82 00 00 Âac 0f 00 00 00 00 00 00 Â|a.cUï...ï.......|
>> 00000010 Â1e 6c b4 f7 04 01 00 00 Â0c 00 00 00 05 00 00 00 Â|.lïï............|
>> 00000020 Â66 44 00 00 ff ff ff ff Â00 00 00 00 00 00 00 00 Â|fD..ïïïï........|
>> 00000030 Â74 69 6d 65 3a 20 31 32 Â39 33 30 35 31 38 32 36 Â|time: 1293051826|
>> 00000040 Â2e 36 33 31 32 34 31 31 Â36 39 0a 00 0c 00 00 00 Â|.631241169......|
>>
>> okay, a small time lag (789 microsec) before the trace starts. (If
>> you're asking why you're seeing the beginning of the trace and how
>> there could have been no overflow, you're in the advanced class. I
>> didn't realize this until Page 4...)
>>
>> btw, the trace_marks are occurring on the order of 1ms apart.
>>
>> Page 1 ends at 0fbc, which fits the 0fac value for "commit" in the page header.
>>
>> Now, the whole of Page 2:
>>
>> 00001000 Â70 25 e0 81 b3 82 00 00 Â34 00 00 00 00 00 00 00 Â|p%ï.ï...4.......|
>> 00001010 Â6c df ff 01 05 00 00 00 Â66 44 00 00 ff ff ff ff Â|lïï.....fD..ïïïï|
>> 00001020 Â00 00 00 00 00 00 00 00 Â74 69 6d 65 3a 20 31 32 Â|........time: 12|
>> 00001030 Â39 33 30 35 31 38 32 36 Â2e 37 31 32 31 31 34 34 Â|93051826.7121144|
>> 00001040 Â36 30 0a 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|60..............|
>>
>> Only a single event, nothing remarkable about it other than that it is
>> on a separate page
>>
>> Page 3 has only a padding event in it:
>>
>> 00002000 Â6b 24 f0 81 b3 82 00 00 Â10 00 00 00 00 00 00 00 Â|k$ï.ï...........|
>> 00002010 Â3d 00 00 00 0c 00 00 00 Â00 00 00 00 00 00 00 00 Â|=...............|
>>
>> So 0x0fac + 0x34 + 0x10 = 0xff0, plus 16 bytes of header makes one
>> page. So, it seems that was the real first page of the ring buffer.
>> The question is, why did it get split up?
>>
>> Here's where it gets more interesting. Here's Page 4:
>>
>> 00003000 Â25 56 1a cb b5 82 00 00 Âd8 0f 00 c0 ff ff ff ff Â|%V.Ë...ï..ïïïïï|
>> 00003010 Â0c 00 00 00 05 00 00 00 Â66 44 00 00 ff ff ff ff Â|........fD..ïïïï|
>> 00003020 Â00 00 00 00 00 00 00 00 Â74 69 6d 65 3a 20 31 32 Â|........time: 12|
>> 00003030 Â39 33 30 35 31 38 33 36 Â2e 35 34 34 36 31 30 31 Â|93051836.5446101|
>> 00003040 Â37 39 0a 00 8c c7 0f 02 Â05 00 00 00 66 44 00 00 Â|79...ï......fD..|
>>
>> Remember the last event on page 2 was at 1293051826.712114460. There
>> are 10 seconds missing!
>>
>> Also, still seeing that strange value for "commit" (0xffffffff0c000fd8).
>>
>> Between the first event on Page4 and the last event, the events seem
>> to be contiguous, very roughly 1msec apart.
>>
>> The last event of the trace is the only event on the last page:
>>
>> 0001d000 Â50 dc 41 47 b6 82 00 00 Â34 00 00 00 00 00 00 00 Â|PïAGï...4.......|
>> 0001d010 Â2c b7 1a 02 05 00 00 00 Â66 44 00 00 ff ff ff ff Â|,ï......fD..ïïïï|
>> 0001d020 Â00 00 00 00 00 00 00 00 Â74 69 6d 65 3a 20 31 32 Â|........time: 12|
>> 0001d030 Â39 33 30 35 31 38 33 38 Â2e 36 33 31 38 37 38 37 Â|93051838.6318787|
>> 0001d040 Â39 34 0a 00 00 00 00 00 Â00 00 00 00 00 00 00 00 Â|94..............|
>>
>> That's at 1293051838.631878794, compared to the recorded "end" time
>> 1293051838.634052658 (diff is 2,173 microsec). So, that very likely is
>> the real end of the trace.
>>
>> This leads to a theory that this has something to do with overflow:
>>
>> $ cat per_cpu/cpu0/stats
>> entries: 0
>> overrun: 16770
>> commit overrun: 0
>>
>> 16770 events where each event is on the order of 1msec apart roughly
>> accounts for 10 seconds of missing time.
>>
>> So, it seems that the ring buffer started out writing to what
>> eventually became the initial reader page. But then as overflow
>> occurs, that page does not get overwritten.
>>
>> There are three outstanding questions in my mind:
>> 1) Why are we seeing the "first" page, which either should have been
>> overwritten, or should not be seen at all?
>> 2) As before: Why the strange "commit" value in the page header for
>> the first contiguous-event page (Page 4)?
>> 3) Why is that first page getting split up?
>>
>> I think I have an idea about (3):
>>
>> in ring_buffer_read_page(), there are three conditions for reading out
>> a page on event at a time. One of them is "reader_page ==
>> commit_page". That seems plausible for a full ring buffer, but I
>> haven't fully groked that.
>>
>> But, within that path, it seems it should still copy out the whole
>> page. But, the loop condition for copying events is "while (len >
>> size)" where len is the remaining available size in the
>> caller-supplied page, and size is the size of the next event in the
>> ring buffer. So, that's off-by-one. if len == size, then there's just
>> enough space.
>>
>> Does this fit? 0x1000 - 0xfbc = 68 bytes for len, but the event in
>> Page 2 is 52 bytes... So no. hmm, another question.
>>
>> But, size is set to "rb_event_ts_length(event)", to ensure space for
>> time extend events to be paired with the event after it. Then in the
>> next loop, "size" bytes are memcpy'd to the caller-supplied page.
>> HOWEVER, the reader is then advanced with rb_advance_reader(), which
>> *does not* advance twice for time-extend events. So, the memcpy will
>> have copied two events, and "pos" is advanced by "size", and "len" is
>> decremented by "size", but then the reader is advanced by *only* one
>> event! This means that time-extend events should be followed by
>> duplicate events. Sure enough, the first event on the Page 1 is a time
>> extend, and it is followed by two identical events (I only show the
>> first word of the duplicate above, though).
>>
>> A patch for these two problems is incoming.
>>
>> Sorry for these long messages. If this is too much detail, I can scale it back.
>>
>
> Okay. Now I've tested with the patch (sent separately).
>
> - Extra pages are only seen when I don't reset the buffer before
> running the test again.
> - So far, when I've seen extra pages, I've only seen one extra page.
> - When there's an extra page, it's still full of pre-overflow data, as
> opposed to, say, data from the previous test.
>

Regarding the strange "commit" values:

I noticed just now that it appears the low 16-bits are the correct
value for "commit", bits 16-31 are c000, and bits 32-63 are ffffffff.

ring_buffer_read_page can store the number of missing events at the
end of the page if there is room. It signals it has done so with two
bits in commit, bits 30 and 31. That's c0000000. This points to the
ffffffff being a signed math problem, because these bits are added
using "local_add".

Here's how the bits are defined:

/* Flag when events were overwritten */
#define RB_MISSED_EVENTS (1 << 31)
/* Missed count stored at end */
#define RB_MISSED_STORED (1 << 30)

well, those would come out as signed int, and 1<<31 is 0x80000000, aka
INT_MIN. When passed to local_add, which takes signed long, that would
be sign-extended to 0xffffffff80000000.

Well, mystery solved at least. Now, how should it be fixed? Or is this
intended behavior?

By my count that leaves only one mystery: why are we seeing this extra
page at the beginning with pre-overflow data?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/