Re: [REGRESSION] x86, perf: counter freezing breaks rr

From: Kyle Huey
Date: Tue Nov 20 2018 - 16:35:15 EST


On Tue, Nov 20, 2018 at 1:19 PM Stephane Eranian <eranian@xxxxxxxxxx> wrote:
>
> On Tue, Nov 20, 2018 at 12:53 PM Kyle Huey <me@xxxxxxxxxxxx> wrote:
> >
> > On Tue, Nov 20, 2018 at 12:11 PM Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote:
> > >
> > > > > > Given that we're already at rc3, and that this renders rr unusable,
> > > > > > we'd ask that counter freezing be disabled for the 4.20 release.
> > > > >
> > > > > The boot option should be good enough for the release?
> > > >
> > > > I'm not entirely sure what you mean here. We want you to flip the
> > > > default boot option so this feature is off for this release. i.e. rr
> > > > should work by default on 4.20 and people should have to opt into the
> > > > inaccurate behavior if they want faster PMI servicing.
> > >
> > > I don't think it's inaccurate, it's just different
> > > than what you are used to.
> > >
> > > For profiling including the kernel it's actually far more accurate
> > > because the count is stopped much earlier near the sampling
> > > point. Otherwise there is a considerable over count into
> > > the PMI handler.
> > >
> > > In your case you limit the count to ring 3 so it's always cut off
> > > at the transition point into the kernel, while with freezing
> > > it's at the overflow point.
> >
> > I suppose that's fair that it's better for some use cases. The flip
> > side is that it's no longer possible to get exactly accurate counts
> > from user space if you're using the PMI (because any events between
> > the overflow itself and the transition to the PMI handler are
> > permanently lost) which is catastrophically bad for us :)
> >
> Let me make sure I got this right. During recording, you count on
> retired-cond-branch
> and you record the value of the PMU counter at specific locations,
> e.g., syscalls.
> During replay, you program the branch-conditional-retired to overflow
> on interrupt at
> each recorded values. So if you sampled the event at 1,000,000 and
> then at 1,500,000.
> Then you program the event with a period of 1,000,000 first, on
> overflow the counter interrupts
> and you get a signal. Then, you reprogram the event for a new period
> of 500,000. During recording
> and replay the event is limited to ring 3 (user level). Am I
> understanding this right?

This is largely correct, except that we only program the interrupt for
events that we would not naturally stop at during the course of
execution such as asynchronous signals or context switch points. At
events that we would naturally stop at (i.e. we can stop at syscalls
via ptrace) we simply check that the counters match to find any
discrepancies faster, before they affect an async signal delivery.
Let's say I have the following event sequence:

1. alarm syscall at rbc=1000
2. SIGALARM delivery at rbc=8000
3. exit syscall at rbc=9000

During replay, we begin the program and run to the syscall via a
PTRACE_SYSCALL ptrace. When the replayed process stops, we check that
the value of the rbc counter is 1000 (we also check that all registers
match what we recorded) and then we emulate the effects of the syscall
on the replayed process's registers and memory.

Then we see that the next event is an asynchronous signal, and we
program the rbc counter to interrupt after an additional (8000 - 1000
- SKID_SIZE) events (where SKID_SIZE has been chosen by
experimentation to ensure that the PMU interrupt is not delivered
*after* the point in the program we care about. For Skylake this value
is 100). We then resume the program with a PTRACE_CONT ptrace and wait
for the PMI to stop the replayed tracee. We advance the program to the
exact point that we care about through a combination of breakpoints
and singlestepping, and then deliver the SIGALARM.

Once that is done, we see that the next event is the exit syscall, and
we again do a PTRACE_SYSCALL ptrace to get to it. Once there we check
the rbc counter value and registers match what were recorded, and
perform the syscall.

Our counters are always restricted to ring 3 in both recording and replay.

- Kyle