Re: [PATCH v11 19/20] x86/kvmclock: Skip kvmclock when Secure TSC is available
From: Nikunj A. Dadhania
Date: Fri Sep 20 2024 - 01:15:41 EST
On 9/18/2024 5:37 PM, Sean Christopherson wrote:
> On Mon, Sep 16, 2024, Nikunj A. Dadhania wrote:
>> On 9/13/2024 11:00 PM, Sean Christopherson wrote:
>>>> Signed-off-by: Nikunj A Dadhania <nikunj@xxxxxxx>
>>>> Tested-by: Peter Gonda <pgonda@xxxxxxxxxx>
>>>> ---
>>>> arch/x86/kernel/kvmclock.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
>>>> index 5b2c15214a6b..3d03b4c937b9 100644
>>>> --- a/arch/x86/kernel/kvmclock.c
>>>> +++ b/arch/x86/kernel/kvmclock.c
>>>> @@ -289,7 +289,7 @@ void __init kvmclock_init(void)
>>>> {
>>>> u8 flags;
>>>>
>>>> - if (!kvm_para_available() || !kvmclock)
>>>> + if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SECURE_TSC))
>>>
>>> I would much prefer we solve the kvmclock vs. TSC fight in a generic way. Unless
>>> I've missed something, the fact that the TSC is more trusted in the SNP/TDX world
>>> is simply what's forcing the issue, but it's not actually the reason why Linux
>>> should prefer the TSC over kvmclock. The underlying reason is that platforms that
>>> support SNP/TDX are guaranteed to have a stable, always running TSC, i.e. that the
>>> TSC is a superior timesource purely from a functionality perspective. That it's
>>> more secure is icing on the cake.
>>
>> Are you suggesting that whenever the guest is either SNP or TDX, kvmclock
>> should be disabled assuming that timesource is stable and always running?
>
> No, I'm saying that the guest should prefer the raw TSC over kvmclock if the TSC
> is stable, irrespective of SNP or TDX. This is effectively already done for the
> timekeeping base (see commit 7539b174aef4 ("x86: kvmguest: use TSC clocksource if
> invariant TSC is exposed")), but the scheduler still uses kvmclock thanks to the
> kvm_sched_clock_init() code.
The kvm-clock and tsc-early both are having the rating of 299. As they are of
same rating, kvm-clock is being picked up first.
Is it fine to drop the clock rating of kvmclock to 298 ? With this tsc-early will
be picked up instead.
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index fafdbf813ae3..1982cee74354 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -289,7 +289,7 @@ void __init kvmclock_init(void)
{
u8 flags;
- if (!kvm_para_available() || !kvmclock || cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
+ if (!kvm_para_available() || !kvmclock)
return;
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
@@ -342,7 +342,7 @@ void __init kvmclock_init(void)
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
!check_tsc_unstable())
- kvm_clock.rating = 299;
+ kvm_clock.rating = 298;
clocksource_register_hz(&kvm_clock, NSEC_PER_SEC);
pv_info.name = "KVM";
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000001] kvm-clock: using sched offset of 6630881179920185 cycles
[ 0.001266] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.005347] tsc: Detected 1996.247 MHz processor
[ 0.263100] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x398caa9ddcb, max_idle_ns: 881590739785 ns
[ 0.980456] clocksource: Switched to clocksource tsc-early
[ 1.059332] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x398caa9ddcb, max_idle_ns: 881590739785 ns
[ 1.062094] clocksource: Switched to clocksource tsc
> The other aspect of this to consider is wallclock. If I'm reading the code
> correctly, _completely_ disabling kvmclock will case the kernel to keep using the
> RTC for wallclock. Using the RTC is an amusingly bad decision for SNP and TDX
> (and regular VMs), as the RTC is a slooow emulation path and it's still very much
> controlled by the untrusted host.
Right, this is not expected.
> Unless you have a better idea for what to do with wallclock, I think the right
> approach is to come up a cleaner way to prefer TSC over kvmclock for timekeeping
> and the scheduler, but leave wallclock as-is. And then for SNP and TDX, "assert"
> that the TSC is being used instead of kvmclock. Presumably, all SNP and TDX
> hosts provide a stable TSC, so there's probably no reason for the guest to even
> care if the TSC is "secure".
>
> Note, past me missed the wallclock side of things[*], so I don't think hiding
> kvmclock entirely is the best solution.
>
> [*] https://lore.kernel.org/all/ZOjF2DMBgW%2FzVvL3@xxxxxxxxxx
Regards
Nikunj