Re: [PATCH] Perf: Correct Assumptions about Sample Timestamps inPasses

From: Joseph Schuchart
Date: Thu Nov 21 2013 - 09:55:49 EST


Sorry for my delayed reply, it's been a busy week and I really wanted to
give Ingo's idea below some thought. Please find my comments below.

>>> If done that way then AFAICS we could even eliminate the
>>> ->max_timestamps[NR_CPUS] array.
>>
>> I can understand your performance concerns. However, I am not
>> sure how we can determine the minimal max_timestamp of all cpus
>> without storing the information on a per-cpu basis first.
>> Accumulating it on the fly would only lead to a global
>> max_timestamp. [...]
>
> Ok. So this:
>
> +static inline void set_next_flush(struct perf_session *session)
> +{
> + int i;
> + u64 min_max_timestamp = session->ordered_samples.max_timestamps[0];
> + for (i = 1; i < MAX_NR_CPUS; i++) {
> + if (min_max_timestamp > session->ordered_samples.max_timestamps[i])
> + min_max_timestamp = session->ordered_samples.max_timestamps[i];
> + }
> + session->ordered_samples.next_flush = min_max_timestamp;
> +}
>
> which should IMHO be written in a bit clearer form as:
>
> static inline void set_next_flush(struct perf_session *session)
> {
> u64 *timestamps = session->ordered_samples.max_timestamps;
> u64 min_timestamp = timestamps[0];
> int i;
>
> for (i = 1; i < MAX_NR_CPUS; i++) {
> if (min_timestamp > timestamps[i])
> min_timestamp = timestamps[i];
> }
>
> session->ordered_samples.next_flush = min_timestamp;
> }

Agreed.

>
> calculates the minimum of the max_timestamps[] array, right?
>
> Now, the max_timestamps[] array gets modified only in a single
> place, from the sample timestamps, via:
>
> os->max_timestamps[sample->cpu] = timestamp;
>
> My suggestion was an identity transformation: to calculate the
> minimum of the array when the max_timestamps[] array is modified.
> A new minimum happens if the freshly written value is smaller
> than the current minimum.

I am really not sure how this would work since I don't see where we
could make progress while flushing if the max_timestamp is only replaced
with a smaller one but is never increased. IMO, it is necessary to
distinguish between timestamps of different cpus to determine the
next_flush timestamp across all cpus in each pass. I have tried to come
up with a working implementation of your idea but do not see a way to
safely increase the value of max_timestamp (without making assumptions
about the order of timestamps between cpus and passes).

Thinking about it a little more I see two catches in my solution:
(1) a cpu is only considered in determining the next_flush timestamp if
at least one event of this cpu has been processed already. Otherwise the
cpu is unknown and its first timestamp could be below the already
flushed timeslice timestamp. This can also happen in the current
implementation.
(2) Flush progress is only guaranteed while events from all cpus are
read. As soon as there are no events left for a cpu which already had
events in this session, the next_flush will remain at the timestamp of
the last pass which had events from all cpus until the final flush. I am
not sure how likely this is. This does not happen in the current
implementation.

As for David's suggestion to read perf_clock() at the start of every
round as a flush timestamp: from looking at the comments in
kernel/sched/clock.c, local_clock() (which is called by perf_clock) is
not guaranteed to be monotonic across cpus so there could still be
earlier timestamps coming late. Or am I missing something there?

Simply increasing the window to more than two passes in the current
implementation might lower the chance of hitting this problem but does
not guarantee success since passes can be very short (only a few events
from a few cpus) or even be empty.

Cheers
Joseph
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/