Re: Using ftrace to identify source of excessive latency of USB write
From: Mason
Date: Tue May 06 2014 - 05:23:15 EST
Mason wrote:
> I'm using Linux on a embedded system similar in spec to a desktop PC
> from 15 years ago (256 MB RAM, 800-MHz CPU, USB). The system's primary
> use is recording high-definition digital television programs.
>
> Typically, the storage sub-system consists of a recent hard-disk drive
> connected over USB (Hi-Speed, effective throughput ~20-30 MB/s), using
> a single ext4 partition (journal disabled), mounted noexec+noatime
> (trying to minimize metadata interference).
>
> Typical bit-rate for this HD content ~1-3 MB/s
>
> Data is accumulated in two 800-kB buffers; when one buffer is full,
> it is written to file (using write(2)), which was opened O_SYNC.
> (Note to self: try O_DSYNC instead of O_SYNC)
>
> If I plot the latency of the write(2) operation, 99% of them complete
> in under 80 ms. However, in rare cases, there is a huge latency spike
> (up to 800 ms). If several of these rare outliers occur in a row,
> the recording is messed up.
>
> I am trying to figure out the source of these latency spikes.
>
> It could be the OS, the USB controller, the HDD controller... I was
> hoping I could use ftrace to determine whether the problem came from
> the OS itself. Is that the best tool for the job?
>
> Any recommendations on how to proceed?
A few people suggested other tools, which I'll give a try; but
I wanted to share my current measurements, in case someone sees
something obvious I am missing.
First of all, I'm using kernel 3.1.10 with "vendor patches"
(which may or may not alter performance and/or behavior).
I wrote a simple test app (attached) which times and ftraces a
single synchronized write of 786432 bytes.
The program reported a latency of 275 ms for the operation
(run in SCHED_FIFO -50 on an otherwise idle system)
I used a USB analyzer to "spy" on the bus during the transaction.
According to my analyzer, the data write was split into 7 transactions
(6*122880 + 49512)
- The first write command was issued at 18.307930
- The first write transaction started at 18.310230
- The first write transaction ended at 18.313567
(3337 µs to transfer 122880 bytes, i.e. 36.8 MB/s)
- The last write transaction ended at 18.368911
(58681 µs to transfer 786432 bytes, i.e. 13.4 MB/s not great, but OK)
- Then one more write (4096 bytes, metadata probably) ending at 18.374458
=> so, from the perspective of the bus, the entire transaction
took 66.5 ms, yet from the perspective of the OS, it took 275 ms.
I've attached the (xz compressed) ftrace (trace.one.xz)
If I'm reading the trace right (which is unlikely) the system spent
64 ms in __filemap_fdatawrite_range
(which is close to the "actual" latency of the bus transactions)
212 ms (??) in schedule
0) | schedule() {
0) | __schedule() {
0) | add_preempt_count() {
0) 0.000 us | } /* add_preempt_count */
0) | rcu_note_context_switch() {
0) 0.000 us | } /* rcu_note_context_switch */
0) | _raw_spin_lock_irq() {
0) | add_preempt_count() {
0) 0.000 us | } /* add_preempt_count */
0) 0.000 us | } /* _raw_spin_lock_irq */
0) | put_prev_task_idle() {
0) 0.000 us | } /* put_prev_task_idle */
0) | pick_next_task_fair() {
0) 0.000 us | } /* pick_next_task_fair */
0) | pick_next_task_stop() {
0) 0.000 us | } /* pick_next_task_stop */
0) | pick_next_task_rt() {
0) 0.000 us | } /* pick_next_task_rt */
0) | pick_next_task_fair() {
0) 0.000 us | } /* pick_next_task_fair */
0) | pick_next_task_idle() {
0) 0.000 us | } /* pick_next_task_idle */
0) | _raw_spin_unlock_irq() {
0) | sub_preempt_count() {
0) 0.000 us | } /* sub_preempt_count */
0) 0.000 us | } /* _raw_spin_unlock_irq */
0) | sub_preempt_count() {
0) 0.000 us | } /* sub_preempt_count */
0) 0.000 us | } /* __schedule */
0) ! 212000.0 us | } /* schedule */
Am I reading this right? 212 ms doing nothing?
212+64 is close to the 275 reported by my application.
How can I "drill down" to understand what is happening
inside schedule?
[I would be grateful if you could CC me in your replies. Thanks!]
--
Regards.
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sched.h>
#include <time.h>
#define ENABLE_TRACING 1
#define BUFSIZE (192 << 12)
static void clear_trace(void)
{
#if ENABLE_TRACING
int fd = open("/sys/kernel/debug/tracing/trace", O_WRONLY);
if (fd < 0) perror("open trace");
int res = write(fd, &fd, sizeof fd);
if (res != sizeof fd) perror("clear trace");
close(fd);
#endif
}
static void dump_trace(char *save)
{
#if ENABLE_TRACING
int fd = open("/sys/kernel/debug/tracing/trace", O_RDONLY);
if (fd < 0) perror("open trace");
int fd_out = open(save, O_WRONLY | O_CREAT | O_TRUNC, 0600);
if (fd_out < 0) perror("open save");
while ( 1 )
{
char buf[8000];
int res = read(fd, buf, sizeof buf);
if (res <= 0) break;
write(fd_out, buf, res);
}
close(fd_out); close(fd);
#endif
}
static void toggle_tracing(char val)
{
(void)val;
#if ENABLE_TRACING
int fd = open("/sys/kernel/debug/tracing/tracing_on", O_WRONLY);
if (fd < 0) perror("open toggle");
int res = write(fd, &val, 1);
if (res != 1) perror(__func__);
close(fd);
#endif
}
static int64_t now(void)
{
struct timespec spec;
clock_gettime(CLOCK_MONOTONIC, &spec);
return spec.tv_sec*1000000000LL + spec.tv_nsec;
}
int main(void)
{
struct sched_param param = { .sched_priority = 50 };
int err = sched_setscheduler(0, SCHED_FIFO, ¶m);
printf("sched_setscheduler=%d\n", err);
char *buf = malloc(BUFSIZE);
if (buf == NULL) return 2;
memset(buf, '+', BUFSIZE);
int flags = O_WRONLY | O_CREAT | O_TRUNC | O_SYNC;
int fd = open("foob", flags, 0600);
if (fd < 0) return 3;
clear_trace();
strcpy(buf, "WARM UP");
write(fd, buf, 4096);
strcpy(buf, "SHOW TIME");
int64_t t0 = now();
toggle_tracing('1');
write(fd, buf, BUFSIZE);
toggle_tracing('0');
int64_t t1 = now();
close(fd);
printf("LAT = %9lld\n", t1-t0);
dump_trace("/tmp/trace.one");
return 0;
}
Attachment:
trace.one.xz
Description: Binary data