Re: [PATCH] mapletree-vs-khugepaged

From: Sven Schnelle
Date: Sun May 15 2022 - 16:02:58 EST


Liam Howlett <liam.howlett@xxxxxxxxxx> writes:

> * Sven Schnelle <svens@xxxxxxxxxxxxx> [220513 10:46]:
>> Starting today we're still seeing the same crash with linux-next from
>> (next-20220513):
>>
>> [ 211.937897] CPU: 7 PID: 535 Comm: pt_upgrade Not tainted 5.18.0-rc6-11648-g76535d42eb53-dirty #732
>> [ 211.937902] Unable to handle kernel pointer dereference in virtual kernel address space
>> [ 211.937903] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
>> [ 211.937906] Failing address: 0e00000000000000 TEID: 0e00000000000803
>> [ 211.937909] Krnl PSW : 0704c00180000000 0000001ca52f06d6
>> [ 211.937910] Fault in home space mode while using kernel ASCE.
>> [ 211.937917] AS:0000001ca6e24007 R3:0000001fffff0007 S:0000001ffffef800 P:000000000000003d
>> [ 211.937914] (mmap_region+0x19e/0x848)
>> [ 211.937929] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>> [ 211.937939] Krnl GPRS: 0000000000000000 0e00000000000000 0000000000000000 0000000000000000
>> [ 211.937942] ffffffff00000f0f ffffffffffffffff 0e00000000000000 0000040000001000
>> [ 211.937945] 0000000083551900 0000040000000000 00000000000000fb 000003800070fc58
>> [ 211.937947] 000000008f490000 0000000000000000 0000001ca52f06b6 000003800070fb48
>> [ 211.937959] Krnl Code: 0000001ca52f06c6: a7740021 brc 7,0000001ca52f0708
>> [ 211.937959] 0000001ca52f06ca: ec6801b3007c cgij %r6,0,8,0000001ca52f0a30
>> [ 211.937959] #0000001ca52f06d0: e310f0f80004 lg %r1,248(%r15)
>> [ 211.937959] >0000001ca52f06d6: e37010000020 cg %r7,0(%r1)
>> [ 211.937959] 0000001ca52f06dc: a78400ea brc 8,0000001ca52f08b0
>> [ 211.937959] 0000001ca52f06e0: e310f0f00004 lg %r1,240(%r15)
>> [ 211.937959] 0000001ca52f06e6: ec180008007c cgij %r1,0,8,0000001ca52f06f6
>> [ 211.937959] 0000001ca52f06ec: e39010080020 cg %r9,8(%r1)
>> [ 211.937973] Call Trace:
>> [ 211.937975] [<0000001ca52f06d6>] mmap_region+0x19e/0x848
>> [ 211.937978] ([<0000001ca52f06b6>] mmap_region+0x17e/0x848)
>> [ 211.937981] [<0000001ca52f116a>] do_mmap+0x3ea/0x4c8
>> [ 211.937983] [<0000001ca52bed12>] vm_mmap_pgoff+0xda/0x178
>> [ 211.937987] [<0000001ca52ed5ea>] ksys_mmap_pgoff+0x62/0x238
>> [ 211.937989] [<0000001ca52ed992>] __s390x_sys_old_mmap+0x7a/0xa0
>> [ 211.937993] [<0000001ca5c4ef5c>] __do_syscall+0x1d4/0x200
>> [ 211.937999] [<0000001ca5c5d572>] system_call+0x82/0xb0
>> [ 211.938002] Last Breaking-Event-Address:
>> [ 211.938003] [<0000001ca5888616>] mas_prev+0xb6/0xc0
>> [ 211.938010] Oops: 0038 ilc:3 [#2]
>> [ 211.938011] Kernel panic - not syncing: Fatal exception: panic_on_oops
>> [ 211.938012] SMP
>> [ 211.938014] Modules linked in:
>> 07: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 0000001C
>> A50679A6
>>
>> IS that issue supposed to be fixed? git bisect pointed me to
>>
>> # bad: [76535d42eb53485775a8c54ea85725812b75543f] Merge branch
>> 'mm-everything' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>>
>> which isn't really helpful.
>>
>> Anything we could help with debugging this?
>
> I tested the maple tree on top of the s390 as it was the same crash and
> it was okay. I haven't tested the mm-everything branch though. Can you
> test mm-unstable?

Yes, i tested mm-unstable but wasn't able to reproduce the issue.

> I'll continue setting up a sparc VM for testing here and test
> mm-everything on that and the s390

One thing that is different compared to x86 is that both sparc and s390
are big endian. Not sure whether and where that would make a difference.

The code to trigger the crash on s390 is rather simple: Just force a
paging level upgrade to 5 levels by calling mmap() with an address that
doesn't fit in 3 levels. Haven't tested whether an upgrade to 4 levels
would be sufficent. I've condensed our test case that triggers this, and
basically all that is required is:

--------------------------------8<---------------------------------------
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <stdio.h>

#define PAGE_SIZE 0x1000
#define _REGION1_SIZE (1UL << 54)

int main(int argc, char *argv[])
{
int pid, status;
void *addr;

pid = fork();
if (pid == 0) {
/*
* Trigger page table level upgrade
*/
addr = mmap((void *)_REGION1_SIZE, PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED)
return 1;
*(int *)addr = 1;
return 0;
}
wait(&status);
return 0;
}
--------------------------------8<---------------------------------------

I've added a few debug statements to the maple tree code:

[ 27.769641] mas_next_entry: offset=14
[ 27.769642] mas_next_nentry: entry = 0e00000000000000, slots=0000000090249f80, mas->offset=15 count=14

I see in mas_next_nentry() that there's a while that iterates over the
(used?) slots until count is reached. After that loop mas_next_entry()
just picks the next (unused?) entry, which is slot 15 in that case.

What i noticed while scanning over include/linux/maple_tree.h is:

struct maple_range_64 {
struct maple_pnode *parent;
unsigned long pivot[MAPLE_RANGE64_SLOTS - 1];
union {
void __rcu *slot[MAPLE_RANGE64_SLOTS];
struct {
void __rcu *pad[MAPLE_RANGE64_SLOTS - 1];
struct maple_metadata meta;
};
};
};

and struct maple_metadata is:

struct maple_metadata {
unsigned char end;
unsigned char gap;
};

If i swap the gap and end members 0x0e00000000000000 becomes
0x000e000000000000. And 0xe matches our msa->offset 14 above.
So it looks like mas_next() in mmap_region returns the meta
data for the node.

So from the lines above you likely already guessed that i have no clue
how mapple tree works, and i didn't had enough time today to read all
the magic and understand it. But i thought i just drop my observation
here in case someone has an idea.

Thanks,
Sven