Re: [PATCH] memory-hotplug: Fix bad area access on dissolve_free_huge_pages()

From: Rui Teng
Date: Tue Sep 20 2016 - 10:46:03 EST

On 9/17/16 12:25 AM, Dave Hansen wrote:

That's an interesting data point, but it still doesn't quite explain
what is going on.

It seems like there might be parts of gigantic pages that have
PageHuge() set on tail pages, while other parts don't. If that's true,
we have another bug and your patch just papers over the issue.

I think you really need to find the root cause before we apply this patch.

The root cause is the test scripts(tools/testing/selftests/memory-
hotplug/ changes online/offline status on memory
blocks other than page header. It will *randomly* select 10% memory
blocks from /sys/devices/system/memory/memory*, and change their
online/offline status.

On my system, the memory block size is 0x10000000:
[root@elvis-n01-kvm memory]# cat block_size_bytes

But the huge page size(16G) is more than this memory block size. So one
huge page is composed by several memory blocks. For example, memory704,
memory705, memory706 and so on. Then memory704 will contain a head
page, but memory705 will *only* contain tail pages. So the problem will
happened on it, if we call:
#echo offline > memory705/state

That's why we need a PageHead() check now, and why this problem does
not happened on systems with smaller huge page such as 16M.

As far as the PageHuge() set, I think PageHuge() will return true for
all tail pages. Because it will get the compound_head for tail page,
and then get its huge page flag.
page = compound_head(page);

And as far as the failure message, if one memory block is in use, it
will return failure when offline it.