Re: [PATCH v3 6/6] x86, mm: Support huge KVA mappings on x86

From: Toshi Kani
Date: Wed Mar 04 2015 - 11:24:19 EST


On Wed, 2015-03-04 at 01:00 +0000, Andrew Morton wrote:
> On Tue, 03 Mar 2015 16:14:32 -0700 Toshi Kani <toshi.kani@xxxxxx> wrote:
>
> > On Tue, 2015-03-03 at 14:44 -0800, Andrew Morton wrote:
> > > On Tue, 3 Mar 2015 10:44:24 -0700 Toshi Kani <toshi.kani@xxxxxx> wrote:
> > :
> > > > +
> > > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > > > +int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> > > > +{
> > > > + u8 mtrr;
> > > > +
> > > > + /*
> > > > + * Do not use a huge page when the range is covered by non-WB type
> > > > + * of MTRRs.
> > > > + */
> > > > + mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE);
> > > > + if ((mtrr != MTRR_TYPE_WRBACK) && (mtrr != 0xFF))
> > > > + return 0;
> > >
> > > It would be good to notify the operator in some way when this happens.
> > > Otherwise the kernel will run more slowly and there's no way of knowing
> > > why. I guess slap a pr_info() in there. Or maybe pr_warn()?
> >
> > We only use 4KB mappings today, so this case will not make it run
> > slowly, i.e. it will be the same as today.
>
> Yes, but it would be slower than it would be if the operator fixed the
> mtrr settings! How do we let the operator know this?
>
> > Also, adding a message here
> > can generate a lot of messages when MTRRs cover a large area.
>
> Really? This is only going to happen when a device driver requests a
> huge io mapping, isn't it? That's rare. We could emit a warning,
> return an error code and fall all the way back to the top-level ioremap
> code which can then retry with 4k mappings. Or something similar -
> somehow record the fact that this warning has been emitted or use
> printk ratelimiting (bad option).

Yes, an IO device with a huge MMIO space that is covered by MTRRs is a
rare case. BIOS does not need to specify how MMIO of each card needs to
be accessed with MTRRs (or BIOS should not do it since an MMIO address
is configurable on each card).

However, PCIe has the MMCONFIG space, PCIe config space, which is also
memory mapped and must be accessed with UC. The PCI subsystem calls
ioremap_nocache() to map the entire MMCONFIG space, which covers the
PCIe config space of all possible cards. Here are boot messages on my
test system.

:
PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xc0000000-0xcf
ffffff] (base 0xc0000000)
PCI: MMCONFIG at [mem 0xc0000000-0xcfffffff] reserved in E820
:

And MTRRs cover this MMCONFIG space with UC to assure that the range is
always accessed with UC.

# cat /proc/mtrr
reg00: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable

So, if we add a message into the code, it will be displayed many times
in this ioremap_nocache() call from PCI.

Ideally, pud_set_huge() and pmd_set_huge() should allow using a huge
page mapping when the entire map range is covered by a single MTRR
entry, which is the case with MMCONFIG. But I did not include such
handling into the patch because UC map is slow by itself, MMCONFIG is
only accessed at boot-time, and mtrr_type_lookup() does not provide the
level of info necessary.

Thanks,
-Toshi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/