Re: [GIT PULL] x86/mm changes for v5.8
From: Linus Torvalds
Date: Tue Jun 02 2020 - 15:15:08 EST
On Tue, Jun 2, 2020 at 11:29 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> It's trivial enough to fix. We have a static key already which is
> telling us whether SMT scheduling is active.
.. but should we do it here, in switch_mm() in the first place?
Should it perhaps just return an error if user land tries to set the
"flush L1$" thing on an SMT system? And no, I don't think we care at
all if people then start playing games and enabling/disabling SMT
dynamically while applications are working. At that point the admin
kets to keep both of the broken pieces.
Also, see my other point about how "switch_mm()" really isn't even a
protection domain switch to begin with. We're still in the exact same
protection domain we used to be in, with the exact same direct access
to L1D$.
Why would we flush the caches on a random and irrelevant event in
kernel space? switch_mm() simply isn't relevant for caches (well,
unless you have fully virtual ones, but that's a completely different
issue).
Wouldn't it be more sensible to treat it more like TIF_NEED_FPU_LOAD -
have a per-cpu "I need to flush the cache" variable, and then the only
thing a context switch does is to see if the user changed (or
whatever) and then set the bit, and set TIF_NOTIFY_RESUME in the
thread.
Because the L1D$ flush isn't a kernel issue, it's a "don't let user
space try to attack it" issue. The kernel can already read it if it
wants to.
And that's just the "big picture" issues I see. In the big picture,
doing this when SMT is enabled is unbelievably stupid. And in the big
picture, context switch really isn't a security domain change wrt the
L1D$.
The more I look at those patches, the more I go "that's just wrong" on
some of the "small picture" implementation details.
Here's just a few cases that I reacted to
Actual low-level flushing code:
(1) the SW fallback
(a) is only defined on Intel, and initializing the silly cache
flush pages on any other vendor will fail.
(b) seems to assume that 16 pages (order-4) is sufficient and
necessary. Probably "good enough", but it's also an example of "yeah,
that's expensive".
(c) and if I read the code correctly, trying to flush the L1D$ on
non-intel without the HW support, it causes a WARN_ON_ONCE()! WTF?
(2) the HW case is done for any vendor, if it reports the "I have the MSR"
(3) the VMX support certainly has various sanity checks like "oh, CPU
doesn't have X86_BUG_L1TF, then I won't do this even if there was some
kernel command line to say I should". But the new prctrl doesn't have
anything like that. It just enables that L1D$ thing mindlessly,
thinking that user-land software somehow knows what it's doing. BS.
(4) what does L1D_FLUSH_POPULATE_TLB mean?
That "option" makes zero sense. It pre-populates the TLB before
doing the accesses to the L1D$ pages in the SW case, but nothing at
all explains why that is needed. It's clearly not needed for the
caller, since the TLB population only happens for the SW fallback, not
for the HW one.
No documentation, no nothing. It's enabled for the VMX case, not
for the non-vmx case, which makes me suspect it's some crazy "work
around vm monitor page faults, because we know our SW flush fallback
is just random garbage".
In other words, there are those big "high-level design" questions, but
also several oddities in just the implementation details.
I really get the feeling that this feature just isn't ready.
Ingo - would you mind sending me a pull request for the (independent)
TLB cleanups part of that x86/mm tree? Because everything up to and
including commit bd1de2a7aace ("x86/tlb/uv: Add a forward declaration
for struct flush_tlb_info") looks sane.
It's only this L1D$ flushing thing at the end of that branch that I
think isn't fully cooked.
Linus