Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

From: Edgecombe, Rick P
Date: Fri Sep 13 2024 - 15:19:44 EST


On Fri, 2024-09-13 at 16:36 +0800, Yan Zhao wrote:
>
Thanks Yan, this is great!

> This is a lock status report of TDX module for current SEAMCALL retry issue
> based on code in TDX module public repo
> https://github.com/intel/tdx-module.git
> branch TDX_1.5.05.
>
> TL;DR:
> - tdh_mem_track() can contend with tdh_vp_enter().
> - tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
> - tdg_mem_page_accept() can contend with other tdh_mem*().
>
> Proposal:
> - Return -EAGAIN directly in ops link_external_spt/set_external_spte when
>   tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.

Regarding the sept entry contention with the guest, I think KVM might not be
guaranteed to retry the same path and clear the sept entry host priority bit.
What if the first failure exited to userspace because of a pending signal or
something? Then the vcpu could reenter the guest, handle an NMI and go off in
another direction, never to trigger the EPT violation again. This would leave
the SEPT entry locked to the guest.

That is a convoluted scenario that could probably be considered a buggy guest,
but what I am sort of pondering is that the retry solution that loop outside the
fault handler guts will have more complex failure modes around the host priority
bit. The N local retries solution really is a brown paper bag design, but the
more proper looking solution actually has two downsides compared to it:
1. It is based on locking behavior that is not in the spec (yes we can work with
TDX module folks to keep it workable)
2. Failure modes get complex

I think I'm still onboard. Just trying to stress the design a bit.

(BTW it looks like Linux guest doesn't actually retry accept on host priority
busy, so they won't spin on it anyway. Probably any contention here would be a
buggy guest for Linux TDs at least.)

> - Kick off vCPUs at the beginning of page removal path, i.e. before the
>   tdh_mem_range_block().
>   Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
>   (one possible optimization:
>    since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
>    do not kick off vCPUs in normal conditions.
>    When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
>    TD enter until page removal completes.)
>
> Below are detailed analysis:
>
> === Background ===
> In TDX module, there are 4 kinds of locks:
> 1. sharex_lock:
>    Normal read/write lock. (no host priority stuff)
>
> 2. sharex_hp_lock:
>    Just like normal read/write lock, except that host can set host priority
> bit
>    on failure.
>    when guest tries to acquire the lock and sees host priority bit set, it
> will
>    return "busy host priority" directly, letting host win.
>    After host acquires the lock successfully, host priority bit is cleared.
>
> 3. sept entry lock:
>    Lock utilizing software bits in SEPT entry.
>    HP bit (Host priority): bit 52
>    EL bit (Entry lock): bit 11, used as a bit lock.
>
>    - host sets HP bit when host fails to acquire EL bit lock;
>    - host resets HP bit when host wins.
>    - guest returns "busy host priority" if HP bit is found set when guest
> tries
>      to acquire EL bit lock.
>
> 4. mutex lock:
>    Lock with only 2 states: free, lock.
>    (not the same as linux mutex, not re-scheduled, could pause() for
> debugging).
>
> ===Resources & users list===
>
> Resources              SHARED  users              EXCLUSIVE users
> ------------------------------------------------------------------------
> (1) TDR                tdh_mng_rdwr               tdh_mng_create
>                        tdh_vp_create              tdh_mng_add_cx
>                        tdh_vp_addcx               tdh_mng_init
>                        tdh_vp_init                tdh_mng_vpflushdone
>                        tdh_vp_enter               tdh_mng_key_config
>                        tdh_vp_flush               tdh_mng_key_freeid
>                        tdh_vp_rd_wr               tdh_mr_extend
>                        tdh_mem_sept_add           tdh_mr_finalize
>                        tdh_mem_sept_remove        tdh_vp_init_apicid
>                        tdh_mem_page_aug           tdh_mem_page_add
>                        tdh_mem_page_remove
>                        tdh_mem_range_block
>                        tdh_mem_track
>                        tdh_mem_range_unblock
>                        tdh_phymem_page_reclaim

In pamt_walk() it calls promote_sharex_lock_hp() with the lock type passed into
pamt_walk(), and tdh_phymem_page_reclaim() passed TDX_LOCK_EXCLUSIVE. So that is
an exclusive lock. But we can ignore it because we only do reclaim at TD tear
down time?

Separately, I wonder if we should try to add this info as comments around the
SEAMCALL implementations. The locking is not part of the spec, but never-the-
less the kernel is being coded against these assumptions. So it can sort of be
like "the kernel assumes this" and we can at least record what the reason was.
Or maybe just comment the parts that KVM assumes.