I reproduced on next-20230127 (did not try upstream yet).
Upstream's fine; on next-20230127 (with David's repro) it bisects to
5ddaec50023e ("mm/mmap: remove __vma_adjust()"). I think I'd better
hand on to Liam, rather than delay you by puzzling over it further myself.
Thanks for identifying the problematic commit! ...
I think two key things are that a) THP are set to "always" and b) we have a
NUMA setup [I assume].
The relevant bits:
[ 439.886738] page:00000000c4de9000 refcount:513 mapcount:2
mapping:0000000000000000 index:0x20003 pfn:0x14ee03
[ 439.893758] head:000000003d5b75a4 order:9 entire_mapcount:0
nr_pages_mapped:511 pincount:0
[ 439.899611] memcg:ffff986dc4689000
[ 439.902207] anon flags:
0x17ffffc009003f(locked|referenced|uptodate|dirty|lru|active|head|swapbacked|node=0|zone=2|lastcpupid=0x1fffff)
[ 439.910737] raw: 0017ffffc0020000 ffffe952c53b8001 ffffe952c53b80c8
dead000000000400
[ 439.916268] raw: 0000000000000000 0000000000000000 0000000000000001
0000000000000000
[ 439.921773] head: 0017ffffc009003f ffffe952c538b108 ffff986de35a0010
ffff98714338a001
[ 439.927360] head: 0000000000020000 0000000000000000 00000201ffffffff
ffff986dc4689000
[ 439.932341] page dumped because: VM_BUG_ON_PAGE(!first && (flags & ((
rmap_t)((((1UL))) << (0)))))
Indeed, the mapcount of the subpage is 2 instead of 1. The subpage is only
mapped into a single
page table (no fork() or similar).
Yes, that mapcount:2 is weird; and what's also weird is the index:0x20003:
what is remove_migration_pte(), in an mbind(0x20002000,...), doing with
index:0x20003?
I was assuming the whole folio would get migrated. As you raise below,
it's all a bit unclear once THP get involved and dealing with mbind()
and page migration.
I created this reduced reproducer that triggers 100%:
Very helpful, thank you.
#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>
#include <numaif.h>
int main(void)
{
mmap((void*)0x20000000ul, 0x1000000ul, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0ul);
madvise((void*)0x20000000ul, 0x1000000ul, MADV_HUGEPAGE);
*(uint32_t*)0x20000080 = 0x80000;
mlock((void*)0x20001000ul, 0x2000ul);
mlock((void*)0x20000000ul, 0x3000ul);
It's not an mlock() issue in particular: quickly established by
substituting madvise(,, MADV_NOHUGEPAGE) for those mlock() calls.
Looks like a vma splitting issue now.
Gah, should have tried something like that first before suspecting it's
mlock related. :)
mbind((void*)0x20002000ul, 0x1000ul, MPOL_LOCAL, NULL, 0x7fful,
MPOL_MF_MOVE);
I guess it will turn out not to be relevant to this particular syzbug,
but what do we expect an mbind() of just 0x1000 of a THP to do?
It's a subject I've wrestled with unsuccessfully in the past: I found
myself arriving at one conclusion (split THP) in one place, and a contrary
conclusion (widen range) in another place, and never had time to work out
one unified answer.
I'm aware of a similar issue with long-term page pinning: we might want
to pin a 4k portion of a THP, but will end up blocking the whole THP
from getting migrated/swapped/split/freed/ ... until we unpin (ever?). I
wrote a reproducer [1] a while ago to show how you can effectively steal
most THP in the system using comparatively small memlock limit using
io_uring ...