Re: [PATCH bpf-next] bpf: introduce helper bpf_raw_read_cpu_clock

From: Marcelo Tosatti
Date: Thu Oct 07 2021 - 05:42:24 EST


Hi Peter, Song,

On Thu, Oct 07, 2021 at 09:18:56AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 06, 2021 at 02:37:09PM -0700, Song Liu wrote:
> > On Wed, Oct 6, 2021 at 10:52 AM Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> > >
> > >
> > >
> > > Add bpf_raw_read_cpu_clock helper, to read architecture specific
> > > CPU clock. In x86's case, this is the TSC.
> > >
> > > This is necessary to synchronize bpf traces from host and guest bpf-programs
> > > (after subtracting guest tsc-offset from guest timestamps).
> >
> > Trying to understand the use case. So in a host-guest scenario,
> > bpf_ktime_get_ns()
> > will return different values in host and guest, but rdtsc() will give
> > the same value.
> > Is this correct?
>
> No, it will not.

No, but we can find out the delta between host and guest TSCs.

On x86, you can read the offset through debugfs file:

debugfs_create_file("tsc-offset", 0444, debugfs_dentry, vcpu,
&vcpu_tsc_offset_fops);

Other architectures can expose that offset.

> Also, please explain if any of this stands a chance of
> working for anything other than x86.

Yes, the same pattern repeats

ARM:

With offset between guest and host:
https://developer.arm.com/documentation/ddi0595/2020-12/AArch64-Registers/CNTVCT-EL0--Counter-timer-Virtual-Count-register?lang=en

Without offset:
commit 051ff581ce70e822729e9474941f3c206cbf7436

PPC:
https://yhbt.net/lore/all/5f267a8aec5b8199a580c96ab2b1a3c27de4eb09.camel@xxxxxxxxx/T/

(Time Base Register is read through mftb instruction).

> Or even on x86 in the face of
> guest migration.

It won't, but honestly we don't care about tracing at this level across
migration.

> Also, please explain, again, what's wrong with dumping snapshots of
> CLOCK_MONOTONIC{,_RAW} from host and guest and correlating time that
> way?

You can't read the guest and the host clock at the same time (there will always
be some variable delay between reading the two clocks). And that delay
is not fixed, but variable (depending on scheduling of the guest vcpus,
for example). So you will need an algorithm to estimate their differences,
with non zero error bounds:

"
Add a driver with gettime method returning hosts realtime clock.
This allows Chrony to synchronize host and guest clocks with
high precision (see results below).

chronyc> sources
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#* PHC0 0 3 377 6 +4ns[ +4ns] +/- 3ns
"

Now with the hardware clock (which is usually the base for CLOCK_MONOTONIC_RAW),
there are no errors (offset will be 0 ns, rather than 3/4ns).

> And also explain why BPF needs to do this differently than all the other
> tracers.

For x86 we use:

echo "x86-tsc" > /sys/kernel/debug/tracing/trace_clock

For this purpose, on x86, so its not like anything different is being
done?