Re: [Bug #12465] KVM guests stalling on 2.6.28 (bisected)

From: Frederic Weisbecker
Date: Tue Mar 17 2009 - 20:20:25 EST


On Tue, Mar 17, 2009 at 09:25:37AM +1030, Kevin Shanahan wrote:
> On Mon, 2009-03-16 at 21:07 +0100, Frederic Weisbecker wrote:
> > I've looked a bit at your traces.
> > I think it's probably too wide to find something inside.
> > Latest -tip is provided with a new set of events tracing, meaning
> > that you will be able to produce function graph traces with various
> > sched events included.
> >
> > Another thing, is it possible to reproduce it with only one ping?
> > Or testing perioding pings and keep only one that raised a relevant
> > threshold of latency? I think we could do a script that can do that.
> > It would make the trace much clearer.
>
> Yeah, I think that should be possible. If you can come up with such a
> script, that would be great.

Ok, I've made a small script based on yours which could do this job.
You will just have to set yourself a threshold of latency
that you consider as buggy. I don't remember the latency you observed.
About 5 secs right?

It's the "thres" variable in the script.

The resulting trace should be a mixup of the function graph traces
and scheduler events which look like this:

gnome-screensav-4691 [000] 6716.774277: 4691:120:S ==> [000] 0:140:R <idle>
xfce4-terminal-4723 [001] 6716.774303: 4723:120:R + [001] 4289:120:S Xorg
xfce4-terminal-4723 [001] 6716.774417: 4723:120:S ==> [001] 4289:120:R Xorg
Xorg-4289 [001] 6716.774427: 4289:120:S ==> [001] 0:140:R <idle>

+ is a wakeup and ==> is a context switch.


The script will loop trying some pings and will only keep the trace that matches
the latency threshold you defined.

Tell if the following script work for you.

You will need to pull the latest -tip tree and enable the following:

CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_EVENT_TRACER=y

Thanks!

Ah and you will need python too (since bash can't do floating point
operation, it uses python here).

#!/bin/bash

# Switch off all CPUs except for one to simplify the trace
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online


# Make sure debugfs has been mounted
if [ ! -d /sys/kernel/debug/tracing ]; then
mount -t debugfs debugfs /sys/kernel/debug
fi

# Set up the trace parameters
pushd /sys/kernel/debug/tracing || exit 1
echo 0 > tracing_enabled
echo function_graph > current_tracer
echo funcgraph-abstime > trace_options
echo funcgraph-proc > trace_options

# Set here the kvm IP addr
addr=""

# Set here a threshold of latency in sec
thres="5"
found="False"
lat=0
prefix=/sys/kernel/debug/tracing

echo 1 > $prefix/events/sched/sched_wakeup/enable
echo 1 > $prefix/events/sched/sched_switch/enable

while [ "$found" != "True" ]
do
# Flush the previous buffer
echo nop > $prefix/current_tracer

# Reset the function_graph tracer
echo function_graph > $prefix/current_tracer

echo 1 > $prefix/tracing_enabled
lat=$(ping -c 1 $addr | grep rtt | grep -Eo " [0-9]+.[0-9]+")
echo 0 > $prefix/tracing_enabled

found=$(python -c "print float(str($lat).strip()) > $thres")
sleep 0.01
done

echo 0 > $prefix/events/sched/sched_wakeup/enable
echo 0 > $prefix/events/sched/sched_switch/enable


echo "Found buggy latency: $lat"
echo "Please send the trace you will find on $prefix/trace"



>
> > Just wait a bit, I'm looking at which event could be relevant to enable
> > and I come back to you with a set of commands to test.
>
> Excellent! Thanks for looking into this.
>
> Cheers,
> Kevin.
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/