Re: perf: fuzzer crashes immediately on AMD system

From: Vince Weaver
Date: Mon Aug 22 2016 - 21:10:49 EST


On Mon, 22 Aug 2016, Huang Rui wrote:

> Hi Peter, Vince
>
> On Fri, Aug 19, 2016 at 12:01:30PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 18, 2016 at 10:46:31AM -0400, Vince Weaver wrote:
> > > On Thu, 18 Aug 2016, Vince Weaver wrote:
> > >
> > > > Tried the perf_fuzzer on my A10 fam15h/model13h system with 4.8-rc2 and it
> > > > falls over more or less immediately.
> > > >
> > > > This maps to variable_test_bit()
> > > > called by ctx = find_get_context(pmu, task, event);
> > > > in kernel/events/core.c:9467
> > > >
> > > > It happens quickly enough I can probably track down the exact event that
> > > > causes this, if needed.
> > >
> > > I have a one line reproducer:
> > >
> > > perf stat -a -e amd_nb/config=0x37,config1=0x20/ /bin/ls
> >
> > OK, cannot reproduce on my fam15h/model1h. I'll go dig through the
> > various manuals to see if I can spot the fail.
> >
> > Huang could you either prod someone at AMD or do yourself, audit the AMD
> > perf code for all the various new models?
>
> Actually, there might be some NBPMC event changes between model 0h-fh and
> model 10h-1fh. Below are the documents of these two processors:
>
> http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
> http://support.amd.com/TechDocs/42300_15h_Mod_10h-1Fh_BKDG.pdf
>
> In section 3.16, it describes usage of NB Performance Counter Events.

I don't think it's the hardware that's causing the problem.

I've wasted a lot more time on it, and finally figured out how the "bt"
instruction works, so the assembly more or less makes sense.

The problem is the per-cpu amd_uncore struct is being over-written with
kernel memory addresses.

This makes uncore[0]->cpu a large number (it's often, but not always, the
per-cpu address of uncore[1]->cpu) which leads to the GPF.

I can't figure out what piece of code is overwriting things though.

And to make things complicated, I think the
amd_uncore_find_online_sibling()
function is broken. The code could really use more commenting, but I
think it is designed so all siblings share one single amd_uncore
structure, but in practice it looks like this doesn't work due to the way
the list iterator works.

Vince