Re: [PATCH] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

From: Ingo Molnar
Date: Tue Sep 10 2019 - 02:18:30 EST

* Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:

> On Fri, Sep 06, 2019 at 04:29:50PM -0500, Steve Wahl wrote:
> > Our hardware (UV aka Superdome Flex) has address ranges marked
> > reserved by the BIOS. These ranges can cause the system to halt if
> > accessed.
> >
> > During kernel initialization, the processor was speculating into
> > reserved memory causing system halts. The processor speculation is
> > enabled because the reserved memory is being mapped by the kernel.
> >
> > The page table level2_kernel_pgt is 1 GiB in size, and had all pages
> > initially marked as valid, and the kernel is placed anywhere in this
> > range depending on the virtual address selected by KASLR. Later on in
> > the boot process, the valid area gets trimmed back to the space
> > occupied by the kernel.
> >
> > But during the interval of time when the full 1 GiB space was marked
> > as valid, if the kernel physical address chosen by KASLR was close
> > enough to our reserved memory regions, the valid pages outside the
> > actual kernel space were allowing the processor to issue speculative
> > accesses to the reserved space, causing the system to halt.
> >
> > This was encountered somewhat rarely on a normal system boot, and
> > somewhat more often when starting the crash kernel if
> > "crashkernel=512M,high" was specified on the command line (because
> > this heavily restricts the physical address of the crash kernel,
> > usually to within 1 GiB of our reserved space).
> >
> > The answer is to invalidate the pages of this table outside the
> > address range occupied by the kernel before the page table is
> > activated. This patch has been validated to fix this problem on our
> > hardware.
> If the goal is to avoid *any* mapping of the reserved region to stop
> speculation, I don't think this patch will do the job. We still (likely)
> have the same memory mapped as part of the identity mapping. And it
> happens at least in two places: here and before on decompression stage.

Yeah, this really needs a fix at the KASLR level: it should only ever map
into regions that are fully RAM backed.

Is the problem that the 1 GiB mapping is a direct mapping, which can be
speculated into? I presume KASLR won't accidentally map the kernel into
the reserved region, right?