Re: [PATCH 1/2] x86/arch_prctl: add ARCH_SET_{COMPAT,NATIVE} to change compatible mode
From: Andy Lutomirski
Date: Thu Apr 21 2016 - 19:46:39 EST
On Thu, Apr 21, 2016 at 4:27 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Thu, Apr 21, 2016 at 1:12 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> On Thu, Apr 21, 2016 at 12:39:42PM -0700, Andy Lutomirski wrote:
>>> On Wed, Apr 20, 2016 at 12:05 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>> > On Wed, Apr 20, 2016 at 08:40:23AM -0700, Andy Lutomirski wrote:
>>
>>> >> >> Peter, I got lost in the code that calls this. Are regs coming from
>>> >> >> the overflow interrupt's regs, current_pt_regs(), or
>>> >> >> perf_get_regs_user?
>>> >> >
>>> >> > So get_perf_callchain() will get regs from:
>>> >> >
>>> >> > - interrupt/NMI regs
>>> >> > - perf_arch_fetch_caller_regs()
>>> >> >
>>> >> > And when user && !user_mode(), we'll use:
>>> >> >
>>> >> > - task_pt_regs() (which arguably should maybe be perf_get_regs_user())
>>> >>
>>> >> Could you point me to this bit of the code?
>>> >
>>> > kernel/events/callchain.c:198
>>>
>>> But that only applies to the callchain code, right?
>>
>> Yes, which is what I thought you were after..
>>
>>> AFAICS the PEBS
>>> code is invoked through the x86_pmu NMI handler and always gets the
>>> IRQ regs. Except for this case:
>>>
>>> static inline void intel_pmu_drain_pebs_buffer(void)
>>> {
>>> struct pt_regs regs;
>>>
>>> x86_pmu.drain_pebs(®s);
>>> }
>>>
>>> which seems a bit confused.
>>
>> Yes, so that only gets used with 'large' pebs, which requires no other
>> flags than PERF_FRERERUNNING_FLAGS, which precludes the regs set from
>> being used.
>>
>> Could definitely use a comment.
>>
>>> I don't suppose we could arrange to pass something consistent into the
>>> PEBS handlers...
>>>
>>> Or is the PEBS code being called from the callchain code somehow?
>>
>> No. I think we were/are slightly talking past one another.
>>
>>> >> One call to perf_get_user_regs per interrupt shouldn't be too bad --
>>> >> certainly much better then one per PEBS record. One call to get user
>>> >> ABI per overflow would be even less bad, but at that point, folding it
>>> >> in to the PEBS code wouldn't be so bad either.
>>> >
>>> > Right; although note that the whole fixup_ip() thing requires a single
>>> > record per interrupt (for we need the LBR state for each record in order
>>> > to rewind).
>>>
>>> So do earlier PEBS events not get rewound? Or so we just program the
>>> thing to only ever give us one event at a time?
>>
>> The latter; we program PEBS such that it can hold but a single record
>> and thereby assure we get an interrupt for each record.
>>
>>> > The problem here is that the overflow stuff is designed for a single
>>> > 'event' per interrupt, so passing it data for multiple events is
>>> > somewhat icky.
>>>
>>> It also seems that there's a certain amount of confusion as to exactly
>>> what "regs" means in various contexts. Or at least I'm confused by
>>> it.
>>
>> Yes, there's too much regs.
>>
>> Typically 'regs' is the 'interrrupt'/'event' regs, that is the register
>> set at eventing time. For sampling hardware PMUs this is NMI/IRQ like
>> things, for software events this ends up being
>> perf_arch_fetch_caller_regs().
>>
>> Then there's PERF_SAMPLE_REGS_USER|PERF_SAMPLE_STACK_USER, which, for
>> each event with it set, use perf_get_regs_user() to dump the thing into
>> our ringbuffer as part of the event record.
>>
>> And then there's the callchain code, which first unwinds kernel space if
>> the 'interrupt'/'event' reg set points into the kernel, and then uses
>> task_pt_regs() (which I think we agree should be perf_get_regs_user())
>> to obtain the user regs to continue with the user stack unwind.
>>
>> Finally there's PERF_SAMPLE_REGS_INTR, which dumps whatever
>> 'interrupt/event' regs we get into the ringbuffer sample record.
>>
>>
>> Did that help? Or did I confuse you moar?
>>
>
> I think I'm starting to get it. What if we rearrange slightly, like this:
>
I started fiddling to see what's involved, then I got to this:
if (sample_type & PERF_SAMPLE_REGS_INTR) {
u64 abi = data->regs_intr.abi;
/*
* If there are no regs to dump, notice it through
* first u64 being zero (PERF_SAMPLE_REGS_ABI_NONE).
*/
perf_output_put(handle, abi);
if (abi) {
u64 mask = event->attr.sample_regs_intr;
perf_output_sample_regs(handle,
data->regs_intr.regs,
mask);
}
}
regs_intr.abi comes from perf_regs_abi(current), which, on x86_64 or
arm64, may indicate 32-bit regs, but the actual regs are always
64-bit. Am I just confused or is this a bug?
--Andy