Re: [PATCH] mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind

From: Yang Shi
Date: Tue Jun 18 2019 - 17:18:38 EST




On 6/18/19 11:28 AM, Michal Hocko wrote:
On Tue 18-06-19 10:06:54, Yang Shi wrote:

On 6/18/19 6:02 AM, Michal Hocko wrote:
[Cc networking people - see a question about setsockopt below]

On Tue 18-06-19 02:48:10, Yang Shi wrote:
When running syzkaller internally, we ran into the below bug on 4.9.x
kernel:

kernel BUG at mm/huge_memory.c:2124!
What is the BUG_ON because I do not see any BUG_ON neither in v4.9 nor
the latest stable/linux-4.9.y
The line number might be not exactly same with upstream 4.9 since there
might be some our internal patches.

It is line 2096 at mm/huge_memory.c in 4.9.182.
So it is
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
that is later mentioned that has been removed. Good. Thanks for the
clarification!

invalid opcode: 0000 [#1] SMP KASAN
[...]
Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
RIP [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
RSP <ffff88006899f980>

with the below test:

---8<---

uint64_t r[1] = {0xffffffffffffffff};

int main(void)
{
syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
intptr_t res = 0;
res = syscall(__NR_socket, 0x11, 3, 0x300);
if (res != -1)
r[0] = res;
*(uint32_t*)0x20000040 = 0x10000;
*(uint32_t*)0x20000044 = 1;
*(uint32_t*)0x20000048 = 0xc520;
*(uint32_t*)0x2000004c = 1;
syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
*(uint64_t*)0x20000340 = 2;
syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340,
0x45d4, 3);
return 0;
}

---8<---

Actually the test does:

mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
socket(AF_PACKET, SOCK_RAW, 768) = 3
setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
Ughh. Do I get it right that that this setsockopt allows an arbitrary
contiguous memory allocation size to be requested by a unpriviledged
user? Or am I missing something that restricts there any restriction?
It needs CAP_NET_RAW to call socket() to set socket type to RAW. The test is
run by root user.
OK, good. That is much better. I just didn't see the capability check. I
can see one in packet_create but I do not see any in setsockopt. Maybe I
just got lost in indirection or implied security model.
[...]
Change migrate_page_add() to check if the page is movable or not, if it
is unmovable, just return -EIO. We don't have to check non-LRU movable
pages since just zsmalloc and virtio-baloon support this. And, they
should be not able to reach here.
You are not checking whether the page is movable, right? You only rely
on PageLRU check which is not really an equivalent thing. There are
movable pages which are not LRU and also pages might be off LRU
temporarily for many reasons so this could lead to false positives.
I'm supposed non-LRU movable pages could not reach here. Since most of them
are not mmapable, i.e. virtio-balloon, zsmalloc. zram device is mmapable,
but the page fault to that vma would end up allocating user space pages
which are on LRU. If I miss something please let me know.
That might be true right now but it is a very subtle assumption that
might break easily in the future. The point is still that even LRU pages
might be isolated from the LRU list temporarily and you do not want this
to cause the failure easily.

I used to have !__PageMovable(page), but it was removed since the aforementioned reason. I could add it back.

For the temporary off LRU page, I did a quick search, it looks the most paths have to acquire mmap_sem, so it can't race with us here. Page reclaim/compaction looks like the only race. But, since the mapping should be preserved even though the page is off LRU temporarily unless the page is reclaimed, so we should be able to exclude temporary off LRU pages by calling page_mapping() and page_anon_vma().

So, the fix may look like:

if (!PageLRU(head) && !__PageMovable(page)) {
ÂÂÂ if (!(page_mapping(page) || page_anon_vma(page)))
ÂÂÂ ÂÂÂ return -EIO;
}