RE: One (possible) x86 get_user_pages bug

From: Kaushik Barde
Date: Mon Jan 31 2011 - 15:10:08 EST


<< I'm not sure I follow you here. The issue with TLB flush IPIs is that
the hypervisor doesn't know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.>>

That's what I was trying understand, what is "Sleep" here? Is it ACPI sleep
or some internal scheduling state? If vCPUs are asynchronous to pCPU in
terms of ACPI sleep state, then they need to synced-up. That's where entire
ACPI modeling needs to be considered. That's where KVM may not see this
issue. Maybe I am missing something here.

<< A "few hundred uSecs" is really very slow - that's nearly a
millisecond. It's worth spending some effort to avoid those kinds of
delays.>>

Actually, just checked IPIs are usually 1000-1500 cycles long (comparable to
VMEXIT). My point is ideal solution should be where virtual platform
behavior is closer to bare metal interrupts, memory, cpu state etc.. How to
do it ? well that's what needs to be figured out :-)

-Kaushik


-----Original Message-----
From: Jeremy Fitzhardinge [mailto:jeremy@xxxxxxxx]
Sent: Monday, January 31, 2011 10:05 AM
To: Kaushik Barde
Cc: 'Avi Kivity'; 'Jan Beulich'; 'Xiaowei Yang'; 'Nick Piggin'; 'Peter
Zijlstra'; fanhenglong@xxxxxxxxxx; 'Kenneth Lee'; 'linqaingmin';
wangzhenguo@xxxxxxxxxx; 'Wu Fengguang'; xen-devel@xxxxxxxxxxxxxxxxxxx;
linux-kernel@xxxxxxxxxxxxxxx; 'Marcelo Tosatti'
Subject: Re: One (possible) x86 get_user_pages bug

On 01/30/2011 02:21 PM, Kaushik Barde wrote:
> I agree i.e. deviation from underlying arch consideration is not a good
> idea.
>
> Also, agreed, hypervisor knows which page entries are ready for TLB flush
> across vCPUs.
>
> But, using above knowledge, along with TLB flush based on IPI is a better
> solution. Its ability to synchronize it with pCPU based IPI and TLB flush
> across vCPU. is key.

I'm not sure I follow you here. The issue with TLB flush IPIs is that
the hypervisor doesn't know the purpose of the IPI and ends up
(potentially) waking up a sleeping VCPU just to flush its tlb - but
since it was sleeping there were no stale TLB entries to flush.

Xen's TLB flush hypercalls can optimise that case by only sending a real
IPI to PCPUs which are actually running target VCPUs. In other cases,
where a PCPU is known to have stale entries but it isn't running a
relevant VCPU, it can just mark a deferred TLB flush which gets executed
before the VCPU runs again.

In other words, Xen can take significant advantage of getting a
higher-level call ("flush these TLBs") compared just a simple IPI.

Are you suggesting that the hypervisor should export some kind of "known
dirty TLB" table to the guest, and have the guest work out which VCPUs
need IPIs sent to them? How would that work?

> IPIs themselves should be in few hundred uSecs in terms latency. Also, why
> should pCPU be in sleep state for active vCPU scheduled page workload?

A "few hundred uSecs" is really very slow - that's nearly a
millisecond. It's worth spending some effort to avoid those kinds of
delays.

J

> -Kaushik
>
> -----Original Message-----
> From: Avi Kivity [mailto:avi@xxxxxxxxxx]
> Sent: Sunday, January 30, 2011 5:02 AM
> To: Jeremy Fitzhardinge
> Cc: Jan Beulich; Xiaowei Yang; Nick Piggin; Peter Zijlstra;
> fanhenglong@xxxxxxxxxx; Kaushik Barde; Kenneth Lee; linqaingmin;
> wangzhenguo@xxxxxxxxxx; Wu Fengguang; xen-devel@xxxxxxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; Marcelo Tosatti
> Subject: Re: One (possible) x86 get_user_pages bug
>
> On 01/27/2011 08:27 PM, Jeremy Fitzhardinge wrote:
>> And even just considering virtualization, having non-IPI-based tlb
>> shootdown is a measurable performance win, since a hypervisor can
>> optimise away a cross-VCPU shootdown if it knows no physical TLB
>> contains the target VCPU's entries. I can imagine the KVM folks could
>> get some benefit from that as well.
> It's nice to avoid the IPI (and waking up a cpu if it happens to be
> asleep) but I think the risk of deviating too much from the baremetal
> arch is too large, as demonstrated by this bug.
>
> (well, async page faults is a counterexample, I wonder if/when it will
> bite us)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/