Re: Populating multiple ptes at fault time
From: Jeremy Fitzhardinge
Date: Thu Sep 25 2008 - 14:32:55 EST
Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> Avi Kivity wrote:
>>
>>>> The only direct use of pte_young() is in zap_pte_range, within a
>>>> mmu_lazy region. So syncing the A bit state on entering lazy mmu mode
>>>> would work fine there.
>>>>
>>>>
>>> Ugh, leaving lazy pte.a mode when entering lazy mmu mode?
>>>
>>
>> Well, sort of but not quite. The kernel's announcing its about to start
>> processing a batch of ptes, so the hypervisor can take the opportunity
>> to update their state before processing. "Lazy-mode" is from the
>> perspective of the kernel lazily updating some state the hypervisor
>> might care about, and the sync happens when leaving mode.
>>
>> The flip-side is when the hypervisor is lazily updating some state the
>> kernel cares about, so it makes sense that the sync when the kernel
>> enters its lazy mode. But the analogy isn't very good because we don't
>> really have an explicit notion of "hypervisor lazy mode", or a formal
>> handoff of shared state between the kernel and hypervisor. But in this
>> case the behaviour isn't too bad.
>>
>>
>
> Handwavy. I think the two notions are separate <insert handwavy
> counter-arguments>.
Perhaps this helps:
Context switches between guest<->hypervisor are relatively expensive.
The more work we can make each context switch perform the better,
because we can amortize the cost. Rather than synchronously switching
between the two every time one wants to express a state change to the
other, we batch those changes up and only sync when its important.
While there are batched outstanding changes in one, the other will have
a somewhat out of date view of the state. At this level, the idea of
batching is completely symmetrical.
One of the ways we amortize the cost of guest->hypervisor transitions is
by batching multiple pagetable updates together. This works at two
levels: within explicit arch_enter/leave_lazy_mmu lazy regions, and also
because it is analogous to the architectural requirement that you must
flush the tlb before updates "really" happen.
KVM - and other shadow pagetable implementations - have the additional
problem of transmitting A/D state updates from the shadow pagetable into
the guest pagetable. Doing this synchronously has the costs we've been
discussing in this thread (namely, extra faults we would like to
avoid). Doing this in a deferred or batched way is awkward because
there's no analogous architectural asynchrony in updating these pte
flags, and we don't have any existing mechanisms or hooks to support
this kind of deferred update.
However, given that we're talking about cleaning up the pagetable api
anyway, there's no reason we couldn't incorporate this kind of deferred
update in a more formal way. It definitely makes sense when you have
shadow pagetables, and it probably makes sense on other architectures too.
Very few places actually care about the state of the A/D bits; would it
be expensive to make those places explicitly ask for synchronization
before testing the bits (or alternatively, have an explicit query
operation rather than just poking about in the ptes). Martin, does this
help with s390's per-page (vs per-pte) A/D state?
J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/