Re: [PATCH] x86/pat: Fix off-by-one bugs in interval tree search

From: Mariusz Ceier
Date: Sun Dec 01 2019 - 11:09:31 EST


Your patch fixes performance issue on my system and afterwards
/sys/kernel/debug/x86/pat_memtype_list contents are:

PAT memtype list:
write-back @ 0x55ba4000-0x55ba5000
write-back @ 0x5e88c000-0x5e8b5000
write-back @ 0x5e8b4000-0x5e8b5000
write-back @ 0x5e8b4000-0x5e8b8000
write-back @ 0x5e8b7000-0x5e8bb000
write-back @ 0x5e8ba000-0x5e8bc000
write-back @ 0x5e8bb000-0x5e8be000
write-back @ 0x5e8bd000-0x5e8bf000
write-back @ 0x5e8be000-0x5e8c2000
write-back @ 0x5ef3c000-0x5ef3d000
write-back @ 0x5ef6c000-0x5ef6d000
write-back @ 0x5ef6f000-0x5ef70000
write-back @ 0x5ef72000-0x5ef73000
write-back @ 0x5f5b3000-0x5f5b5000
uncached-minus @ 0xe3f00000-0xe3f10000
uncached-minus @ 0xec000000-0xec040000
uncached-minus @ 0xec002000-0xec003000
uncached-minus @ 0xec110000-0xec111000
uncached-minus @ 0xec200000-0xec240000
uncached-minus @ 0xec260000-0xec264000
uncached-minus @ 0xec300000-0xec320000
uncached-minus @ 0xec326000-0xec327000
uncached-minus @ 0xf0000000-0xf8000000
uncached-minus @ 0xf0000000-0xf0001000
uncached-minus @ 0xfdc43000-0xfdc44000
uncached-minus @ 0xfe000000-0xfe001000
uncached-minus @ 0xfed00000-0xfed01000
uncached-minus @ 0xfed10000-0xfed16000
uncached-minus @ 0xfed90000-0xfed91000
write-combining @ 0x2000000000-0x2100000000
write-combining @ 0x2000000000-0x2100000000
uncached-minus @ 0x2100000000-0x2100001000
uncached-minus @ 0x2100001000-0x2100002000
uncached-minus @ 0x2ffff10000-0x2ffff20000
uncached-minus @ 0x2ffff20000-0x2ffff24000

It's very similar to pat_memtype_list contents after reverting 4
x86/mm/pat patches affecting performance:

@@ -1,8 +1,8 @@
PAT memtype list:
write-back @ 0x55ba4000-0x55ba5000
write-back @ 0x5e88c000-0x5e8b5000
-write-back @ 0x5e8b4000-0x5e8b8000
write-back @ 0x5e8b4000-0x5e8b5000
+write-back @ 0x5e8b4000-0x5e8b8000
write-back @ 0x5e8b7000-0x5e8bb000
write-back @ 0x5e8ba000-0x5e8bc000
write-back @ 0x5e8bb000-0x5e8be000
@@ -21,8 +21,8 @@
uncached-minus @ 0xec260000-0xec264000
uncached-minus @ 0xec300000-0xec320000
uncached-minus @ 0xec326000-0xec327000
-uncached-minus @ 0xf0000000-0xf0001000
uncached-minus @ 0xf0000000-0xf8000000
+uncached-minus @ 0xf0000000-0xf0001000
uncached-minus @ 0xfdc43000-0xfdc44000
uncached-minus @ 0xfe000000-0xfe001000
uncached-minus @ 0xfed00000-0xfed01000

Best regards,
Mariusz Ceier

On Sun, 1 Dec 2019 at 14:49, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> * Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> > * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> > > But the final difference is a real difference where it used to be WC,
> > > and is now UC-:
> > >
> > > -write-combining @ 0x2000000000-0x2100000000
> > > -write-combining @ 0x2000000000-0x2100000000
> > > +uncached-minus @ 0x2000000000-0x2100000000
> > > +uncached-minus @ 0x2000000000-0x2100000000
> > >
> > > which certainly could easily explain the huge performance degradation.
>
> > It's not an unconditional regression, as both Boris and me tried to
> > reproduce it on different systems that do ioremap_wc() as well and didn't
> > measure a slowdown, but something about the memory layout probably
> > triggers the tree management bug.
>
> Ok, I think I found at least one bug in the new PAT code, the conversion
> of memtype_check_conflict() is buggy I think:
>
> 8d04a5f97a5f: ("x86/mm/pat: Convert the PAT tree to a generic interval tree")
>
> dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
> found_type = match->type;
>
> - node = rb_next(&match->rb);
> - while (node) {
> - match = rb_entry(node, struct memtype, rb);
> -
> - if (match->start >= end) /* Checked all possible matches */
> - goto success;
> -
> - if (is_node_overlap(match, start, end) &&
> - match->type != found_type) {
> + match = memtype_interval_iter_next(match, start, end);
> + while (match) {
> + if (match->type != found_type)
> goto failure;
> - }
>
> - node = rb_next(&match->rb);
> + match = memtype_interval_iter_next(match, start, end);
> }
>
>
> Note how the '>= end' condition to end the interval check, got converted
> into:
>
> + match = memtype_interval_iter_next(match, start, end);
>
> This is subtly off by one, because the interval trees interfaces require
> closed interval parameters:
>
> include/linux/interval_tree_generic.h
>
> /* \
> * Iterate over intervals intersecting [start;last] \
> * \
> * Note that a node's interval intersects [start;last] iff: \
> * Cond1: ITSTART(node) <= last \
> * and \
> * Cond2: start <= ITLAST(node) \
> */ \
>
> ...
>
> if (ITSTART(node) <= last) { /* Cond1 */ \
> if (start <= ITLAST(node)) /* Cond2 */ \
> return node; /* node is leftmost match */ \
>
> [start;last] is a closed interval (note that '<= last' check) - while the
> PAT 'end' parameter is 1 byte beyond the end of the range, because
> ioremap() and the other mapping APIs usually use the [start,end)
> half-open interval, derived from 'size'.
>
> This is what ioremap() does for example:
>
> /*
> * Mappings have to be page-aligned
> */
> offset = phys_addr & ~PAGE_MASK;
> phys_addr &= PHYSICAL_PAGE_MASK;
> size = PAGE_ALIGN(last_addr+1) - phys_addr;
>
> retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
> pcm, &new_pcm);
>
>
> phys_addr+size will be on a page boundary, after the last byte of the
> mapped interval.
>
> So the correct parameter to use in the interval tree searches is not
> 'end' but 'end-1'.
>
> This could have relevance if conflicting PAT ranges are exactly adjacent,
> for example a future WC region is followed immediately by an already
> mapped UC- region - in this case memtype_check_conflict() would
> incorrectly deny the WC memtype region and downgrade the memtype to UC-.
>
> BTW., rather annoyingly this downgrading is done silently in
> memtype_check_insert():
>
> int memtype_check_insert(struct memtype *new,
> enum page_cache_mode *ret_type)
> {
> int err = 0;
>
> err = memtype_check_conflict(new->start, new->end, new->type, ret_type);
> if (err)
> return err;
>
> if (ret_type)
> new->type = *ret_type;
>
> memtype_interval_insert(new, &memtype_rbroot);
> return 0;
> }
>
>
> So on such a conflict we'd just silently get UC- in *ret_type, and write
> it into the new region, never the wiser ...
>
> So assuming that the patch below fixes the primary bug the diagnostics
> side of ioremap() cache attribute downgrades would be another thing to
> fix.
>
> Anyway, I checked all the interval-tree iterations, and most of them are
> off by one - but I think the one related to memtype_check_conflict() is
> the one causing this particular performance regression.
>
> The only correct interval-tree searches were these two:
>
> arch/x86/mm/pat_interval.c: match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX);
> arch/x86/mm/pat_interval.c: match = memtype_interval_iter_next(match, 0, ULONG_MAX);
>
> The ULONG_MAX was hiding the off-by-one in plain sight. :-)
>
> So it would be nice if everyone who is seeing this bug could test the
> patch below against Linus's latest tree - does it fix the regression?
>
> If not then please send the before/after dump of
> /sys/kernel/debug/x86/pat_memtype_list - and even if it works please send
> the dumps so we can double check it all.
>
> Note that the bug was benign in the sense of implementing a too strict
> cache attribute conflict policy and downgrading cache attributes - so
> AFAICS the worst outcome of this bug would be a performance regression.
>
> Patch is only lightly tested, so take care. (Patch is emphatically not
> signed off yet, because I spent most of the day on this and I don't yet
> trust my fix - all of the affected sites need to be reviewed more
> carefully.)
>
> Thanks,
>
> Ingo
>
>
> ====================>
> From: Ingo Molnar <mingo@xxxxxxxxxx>
> Date: Sun, 1 Dec 2019 15:25:50 +0100
> Subject: [PATCH] x86/pat: Fix off-by-one bugs in interval tree search
>
> NOT-Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
> ---
> arch/x86/mm/pat_interval.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/mm/pat_interval.c b/arch/x86/mm/pat_interval.c
> index 47a1bf30748f..6855362eaf21 100644
> --- a/arch/x86/mm/pat_interval.c
> +++ b/arch/x86/mm/pat_interval.c
> @@ -56,7 +56,7 @@ static struct memtype *memtype_match(u64 start, u64 end, int match_type)
> {
> struct memtype *match;
>
> - match = memtype_interval_iter_first(&memtype_rbroot, start, end);
> + match = memtype_interval_iter_first(&memtype_rbroot, start, end-1);
> while (match != NULL && match->start < end) {
> if ((match_type == MEMTYPE_EXACT_MATCH) &&
> (match->start == start) && (match->end == end))
> @@ -66,7 +66,7 @@ static struct memtype *memtype_match(u64 start, u64 end, int match_type)
> (match->start < start) && (match->end == end))
> return match;
>
> - match = memtype_interval_iter_next(match, start, end);
> + match = memtype_interval_iter_next(match, start, end-1);
> }
>
> return NULL; /* Returns NULL if there is no match */
> @@ -79,7 +79,7 @@ static int memtype_check_conflict(u64 start, u64 end,
> struct memtype *match;
> enum page_cache_mode found_type = reqtype;
>
> - match = memtype_interval_iter_first(&memtype_rbroot, start, end);
> + match = memtype_interval_iter_first(&memtype_rbroot, start, end-1);
> if (match == NULL)
> goto success;
>
> @@ -89,12 +89,12 @@ static int memtype_check_conflict(u64 start, u64 end,
> dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
> found_type = match->type;
>
> - match = memtype_interval_iter_next(match, start, end);
> + match = memtype_interval_iter_next(match, start, end-1);
> while (match) {
> if (match->type != found_type)
> goto failure;
>
> - match = memtype_interval_iter_next(match, start, end);
> + match = memtype_interval_iter_next(match, start, end-1);
> }
> success:
> if (newtype)
> @@ -160,7 +160,7 @@ struct memtype *memtype_erase(u64 start, u64 end)
> struct memtype *memtype_lookup(u64 addr)
> {
> return memtype_interval_iter_first(&memtype_rbroot, addr,
> - addr + PAGE_SIZE);
> + addr + PAGE_SIZE-1);
> }
>
> #if defined(CONFIG_DEBUG_FS)