Re: Fix powerTOP regression with 2.6.39-rc5

From: Steven Rostedt
Date: Fri May 06 2011 - 16:20:59 EST


On Fri, 2011-05-06 at 13:08 -0700, Arjan van de Ven wrote:
> From c2db3ab8cff0d1cfb9582fc149df2794978a332e Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
> Date: Thu, 5 May 2011 23:55:18 -0400
> Subject: [PATCH] Regression: partial revert "tracing: Remove lock_depth
> from event entry"
>
> This patch partially reverts commit
> e6e1e2593592a8f6f6380496655d8c6f67431266.
> That commit changed the structure layout of the trace structure, which in
> turn broke PowerTOP (1.9x generation) quite badly.
>
> I appreciate not wanting to expose the variable in question, and
> PowerTOP was
> not using it, so I've replaced the variable with just a padding field....
> ... that way if in the future a new field is needed it can just use this
> padding
> variable.
>

I strongly NACK this!

We can not be locked down in the format of the trace events.

THAT'S WHY I EXPORTED THE FORMATS IN THE FIRST PLACE

User tools that do not parse the formats, are broken. Look at this
patch. It adds nothing but padding, wasted space in the ring buffer.
Why??? Because a tool read raw binary data without using the correct
means to extract it. I already have a user space library that does the
work for you. Perf currently uses it (although, an older version of it,
and it is hardcoded into perf).

Sure, perhaps event names should not be modfied, nor should some fields
without true reason. But come on, lockdepth? This was added by Frederic
to help with the BKL removal. I think he only used it for a very short
time. It probably should have never been there in the first place. But I
do not want to waste 4 bytes of the ring buffer for every event because
of a broken tool.

Note, coming soon, we will probably be removing preempt count, flags,
and even the pid. What then? Is this broken tool going to prevent us
from moving forward?

I could understand this if we did not give you the means to parse the
data. But we did. It's been there as long as the trace events
themselves.

The format of the event format files are the ABI, not the raw data of
the events themselves.

-- Steve

> Signed-off-by: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
> ---
> include/linux/ftrace_event.h | 1 +
> kernel/trace/trace.c | 1 +
> kernel/trace/trace_events.c | 1 +
> 3 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index 22b32af1..b5a550a 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -37,6 +37,7 @@ struct trace_entry {
> unsigned char flags;
> unsigned char preempt_count;
> int pid;
> + int padding;
> };
>
> #define FTRACE_MAX_EVENT \
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index d38c16a..1cb49be 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1110,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry
> *entry, unsigned long flags,
>
> entry->preempt_count = pc & 0xff;
> entry->pid = (tsk) ? tsk->pid : 0;
> + entry->padding = 0;
> entry->flags =
> #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
> (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index e88f74f..2fe1103 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -116,6 +116,7 @@ static int trace_define_common_fields(void)
> __common_field(unsigned char, flags);
> __common_field(unsigned char, preempt_count);
> __common_field(int, pid);
> + __common_field(int, padding);
>
> return ret;
> }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/