On 4/26/23 10:51, Tom Lendacky wrote:
+ /*
+ * native_stop_other_cpus() will write to @stop_cpus_count after
+ * observing that it went down to zero, which will invalidate the
+ * cacheline on this CPU.
+ */
+ atomic_dec(&stop_cpus_count);
This is probably going to pull in a cache line and cause the problem the
native_wbinvd() is trying to avoid.
Is one _more_ cacheline really the problem?
Or is having _any_ cacheline pulled in a problem? What about the text
page containing the WBINVD? How about all the page table pages that are
needed to resolve %RIP to a physical address?
What about the mds_idle_clear_cpu_buffers() code that snuck into
native_halt()?
ffffffff810ede4c: 0f 09 wbinvd
ffffffff810ede4e: 8b 05 e4 3b a7 02 mov 0x2a73be4(%rip),%eax # ffffffff83b61a38 <mds_idle_clear>
ffffffff810ede54: 85 c0 test %eax,%eax
ffffffff810ede56: 7e 07 jle ffffffff810ede5f <stop_this_cpu+0x9f>
ffffffff810ede58: 0f 00 2d b1 75 13 01 verw 0x11375b1(%rip) # ffffffff82225410 <ds.6688>
ffffffff810ede5f: f4 hlt
ffffffff810ede60: eb ec jmp ffffffff810ede4e <stop_this_cpu+0x8e>
ffffffff810ede62: e8 59 40 1a 00 callq ffffffff81291ec0 <trace_hardirqs_off>
ffffffff810ede67: eb 85 jmp ffffffff810eddee <stop_this_cpu+0x2e>
ffffffff810ede69: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)