Re: [PATCH 09/21] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

From: Yan Zhao
Date: Fri Sep 13 2024 - 04:40:07 EST


This is a lock status report of TDX module for current SEAMCALL retry issue
based on code in TDX module public repo https://github.com/intel/tdx-module.git
branch TDX_1.5.05.

TL;DR:
- tdh_mem_track() can contend with tdh_vp_enter().
- tdh_vp_enter() contends with tdh_mem*() when 0-stepping is suspected.
- tdg_mem_page_accept() can contend with other tdh_mem*().

Proposal:
- Return -EAGAIN directly in ops link_external_spt/set_external_spte when
tdh_mem_sept_add()/tdh_mem_page_aug() returns BUSY.
- Kick off vCPUs at the beginning of page removal path, i.e. before the
tdh_mem_range_block().
Set a flag and disallow tdh_vp_enter() until tdh_mem_page_remove() is done.
(one possible optimization:
since contention from tdh_vp_enter()/tdg_mem_page_accept should be rare,
do not kick off vCPUs in normal conditions.
When SEAMCALL BUSY happens, retry for once, kick off vCPUs and do not allow
TD enter until page removal completes.)

Below are detailed analysis:

=== Background ===
In TDX module, there are 4 kinds of locks:
1. sharex_lock:
Normal read/write lock. (no host priority stuff)

2. sharex_hp_lock:
Just like normal read/write lock, except that host can set host priority bit
on failure.
when guest tries to acquire the lock and sees host priority bit set, it will
return "busy host priority" directly, letting host win.
After host acquires the lock successfully, host priority bit is cleared.

3. sept entry lock:
Lock utilizing software bits in SEPT entry.
HP bit (Host priority): bit 52
EL bit (Entry lock): bit 11, used as a bit lock.

- host sets HP bit when host fails to acquire EL bit lock;
- host resets HP bit when host wins.
- guest returns "busy host priority" if HP bit is found set when guest tries
to acquire EL bit lock.

4. mutex lock:
Lock with only 2 states: free, lock.
(not the same as linux mutex, not re-scheduled, could pause() for debugging).

===Resources & users list===

Resources SHARED users EXCLUSIVE users
------------------------------------------------------------------------
(1) TDR tdh_mng_rdwr tdh_mng_create
tdh_vp_create tdh_mng_add_cx
tdh_vp_addcx tdh_mng_init
tdh_vp_init tdh_mng_vpflushdone
tdh_vp_enter tdh_mng_key_config
tdh_vp_flush tdh_mng_key_freeid
tdh_vp_rd_wr tdh_mr_extend
tdh_mem_sept_add tdh_mr_finalize
tdh_mem_sept_remove tdh_vp_init_apicid
tdh_mem_page_aug tdh_mem_page_add
tdh_mem_page_remove
tdh_mem_range_block
tdh_mem_track
tdh_mem_range_unblock
tdh_phymem_page_reclaim
------------------------------------------------------------------------
(2) KOT tdh_phymem_cache_wb tdh_mng_create
tdh_mng_vpflushdone
tdh_mng_key_freeid
------------------------------------------------------------------------
(3) TDCS tdh_mng_rdwr
tdh_vp_create
tdh_vp_addcx
tdh_vp_init
tdh_vp_init_apicid
tdh_vp_enter
tdh_vp_rd_wr
tdh_mem_sept_add
tdh_mem_sept_remove
tdh_mem_page_aug
tdh_mem_page_remove
tdh_mem_range_block
tdh_mem_track
tdh_mem_range_unblock
------------------------------------------------------------------------
(4) TDVPR tdh_vp_rd_wr tdh_vp_create
tdh_vp_addcx
tdh_vp_init
tdh_vp_init_apicid
tdh_vp_enter
tdh_vp_flush
------------------------------------------------------------------------
(5) TDCS epoch tdh_vp_enter tdh_mem_track
------------------------------------------------------------------------
(6) secure_ept_lock tdh_mem_sept_add tdh_vp_enter
tdh_mem_page_aug tdh_mem_sept_remove
tdh_mem_page_remove tdh_mem_range_block
tdh_mem_range_unblock
------------------------------------------------------------------------
(7) SEPT entry tdh_mem_sept_add
tdh_mem_sept_remove
tdh_mem_page_aug
tdh_mem_page_remove
tdh_mem_range_block
tdh_mem_range_unblock
tdg_mem_page_accept

Current KVM interested SEAMCALLs:
------------------------------------------------------------------------
SEAMCALL Lock Name Lock Type Resource
tdh_mng_create sharex_hp_lock EXCLUSIVE TDR
sharex_lock EXCLUSIVE KOT

tdh_mng_add_cx sharex_hp_lock EXCLUSIVE TDR
sharex_hp_lock EXCLUSIVE page to add

tdh_mng_init sharex_hp_lock EXCLUSIVE TDR
sharex_hp_lock NO_LOCK TDCS

tdh_mng_vpflushdone sharex_hp_lock EXCLUSIVE TDR
sharex_lock EXCLUSIVE KOT

tdh_mng_key_config sharex_hp_lock EXCLUSIVE TDR

tdh_mng_key_freeid sharex_hp_lock EXCLUSIVE TDR
sharex_lock EXCLUSIVE KOT

tdh_mng_rdwr sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS

tdh_mr_extend sharex_hp_lock EXCLUSIVE TDR
sharex_hp_lock NO_LOCK TDCS

tdh_mr_finalize sharex_hp_lock EXCLUSIVE TDR
sharex_hp_lock NO_LOCK TDCS

tdh_vp_create sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_hp_lock EXCLUSIVE TDVPR

tdh_vp_addcx sharex_hp_lock EXCLUSIVE TDVPR
sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_hp_lock EXCLUSIVE page to add

tdh_vp_init sharex_hp_lock EXCLUSIVE TDVPR
sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS

tdh_vp_init_apicid sharex_hp_lock EXCLUSIVE TDVPR
sharex_hp_lock EXCLUSIVE TDR
sharex_hp_lock SHARED TDCS

tdh_vp_enter(*) sharex_hp_lock EXCLUSIVE TDVPR
sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock SHARED TDCS epoch_lock
sharex_lock EXCLUSIVE TDCS secure_ept_lock

tdh_vp_flush sharex_hp_lock EXCLUSIVE TDVPR
sharex_hp_lock SHARED TDR

tdh_vp_rd_wr sharex_hp_lock SHARED TDVPR
sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS

tdh_mem_sept_add sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock SHARED TDCS secure_ept_lock
sept entry lock HOST,EXCLUSIVE SEPT entry to modify
sharex_hp_lock EXCLUSIVE page to add

tdh_mem_sept_remove sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock EXCLUSIVE TDCS secure_ept_lock
sept entry lock HOST,EXCLUSIVE SEPT entry to modify
sharex_hp_lock EXCLUSIVE page to remove

tdh_mem_page_add sharex_hp_lock EXCLUSIVE TDR
sharex_hp_lock NO_LOCK TDCS
sharex_lock NO_LOCK TDCS secure_ept_lock
sharex_hp_lock EXCLUSIVE page to add

tdh_mem_page_aug sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock SHARED TDCS secure_ept_lock
sept entry lock HOST,EXCLUSIVE SEPT entry to modify
sharex_hp_lock EXCLUSIVE page to aug

tdh_mem_page_remove sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock SHARED TDCS secure_ept_lock
sept entry lock HOST,EXCLUSIVE SEPT entry to modify
sharex_hp_lock EXCLUSIVE page to remove

tdh_mem_range_block sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock EXCLUSIVE TDCS secure_ept_lock
sept entry lock HOST,EXCLUSIVE SEPT entry to modify

tdh_mem_track sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock EXCLUSIVE TDCS epoch_lock

tdh_mem_range_unblock sharex_hp_lock SHARED TDR
sharex_hp_lock SHARED TDCS
sharex_lock EXCLUSIVE TDCS secure_ept_lock
sept entry lock HOST,EXCLUSIVE SEPT entry to modify

tdh_phymem_page_reclaim sharex_hp_lock EXCLUSIVE page to reclaim
sharex_hp_lock SHARED TDR

tdh_phymem_cache_wb mutex_lock per package wbt_entries
sharex_lock SHARED KOT

tdh_phymem_page_wbinvd sharex_hp_lock SHARED page to be wbinvd


Current KVM interested TDCALLs:
------------------------------------------------------------------------
tdg_mem_page_accept sept entry lock GUEST SEPT entry to modify

TDCALLs like tdg_mr_rtmr_extend(), tdg_servtd_rd_wr(), tdg_mem_page_attr_wr()
tdg_mem_page_attr_rd() are not included.

*:(a) tdh_vp_enter() holds shared TDR lock and exclusive TDVPR lock, the two
locks are released when exiting to VMM.
(b) tdh_vp_enter() holds shared TDCS lock and shared TDCS epoch_lock lock,
releases them before entering non-root mode.
(c) tdh_vp_enter() holds shared epoch lock, contending with tdh_mem_track().
(d) tdh_vp_enter() only holds EXCLUSIVE secure_ept_lock when 0-stepping is
suspected, i.e. when last_epf_gpa_list is not empty.
When a EPT violation happens, TDX module checks if the guest RIP equals
to the guest RIP of last TD entry. Only when this is true for 6 continuous
times, the gpa will be recorded in last_epf_gpa_list. The list will be
reset once guest RIP of a EPT violation and last TD enter RIP are unequal.


=== Summary ===
For the 8 kinds of common resources protected in TDX module:

(1) TDR:
There are only shared accesses to TDR during runtime (i.e. after TD is
finalized and before TD tearing down), if we don't need to support calling
tdh_vp_init_apicid() at runtime (e.g. for vCPU hotplug).
tdh_vp_enter() holds shared TDR lock until exiting to VMM.
TDCALLs do not acquire the TDR lock.

(2) KOT (Key Ownership Table)
Current KVM code should have avoided contention to this resource.

(3) TDCS:
Shared accessed or access with no lock when TDR is exclusively locked.
Seamcalls in runtime (after TD finalized and before TD tearing down) do not
contend with each other on TDCS.
tdh_vp_enter() holds shared TDCS lock and releases it before entering
non-root mode.
Current TDCALLs for basic TDX do not acquire this lock.

(4) TDVPR:
Per-vCPU exclusive accessed except for tdh_vp_rd_wr().
tdh_vp_enter() holds exclusive TDVPR lock until exiting to VMM.
TDCALLs do not acquire the TDVPR lock.

(5) TDCS epoch:
tdh_mem_track() requests exclusive access, and tdh_vp_enter() requests
shared access.
tdh_mem_track() can contend with tdh_vp_enter().

(6) SEPT tree:
Protected by secure_ept_lock (sharex_lock).
tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() holds shared
lock; tdh_mem_sept_remove()/tdh_mem_range_block()/tdh_mem_range_unblock()
holds exclusive lock.
tdh_vp_enter() requests exclusive access when 0-stepping is suspected,
contending with all other tdh_mem*().
Guest does not acquire this lock.

So,
kvm mmu_lock has prevented contentions between
tdh_mem_sept_add()/tdh_mem_page_aug()/tdh_mem_page_remove() and
tdh_mem_sept_remove()/tdh_mem_range_block().
Though tdh_mem_sept_add()/tdh_mem_page_aug() races with tdh_vp_enter(),
returning -EAGAIN directly is fine for them.
The remaining issue is the contention between tdh_vp_enter() and
tdh_mem_page_remove()/tdh_mem_sept_remove()/tdh_mem_range_block().

(7) SEPT entry:
All exclusive access.
tdg_mem_page_accept() may contend with other tdh_mem*() on a specific SEPT
entry.

(8) PAMT entry for target pages (e.g. page to add/aug/remove/reclaim/wbinvd):
Though they are all exclusively locked, no race should be met as long as
they belong to different pamt entries.

Conclusion:
Current KVM code should have avoided contentions of resources (1)-(4),(8), while
(5),(6),(7) are still possible to meet contention.
- tdh_mem_track() can contend with tdh_vp_enter() for (5)
- tdh_vp_enter() contends with tdh_mem*() for (6) when 0-stepping is suspected.
- tdg_mem_page_accept() can contend with other tdh_mem*() for (7).