Re: [PATCH 3/4] Check whether pages are poisoned before copying

From: Jin Dongming
Date: Fri Mar 18 2011 - 01:26:41 EST


Hi, Andi

(2011/03/18 0:21), Andi Kleen wrote:
>> At least copy to the last page of the huge page is performed after
>> all preceding copies are finished. So I'm not sure it is really
>> "a few" or not.
>> Still I think making the window smaller than now is worthwhile,
>> no matter it is change from 0.1% to 0.01%, or from 0.01% to 0.001%.
>
> Note that hwpoison will never reach 100% coverage. That's impossible.
> But to get nearer to 100% it's better to concentrate of the paths
> that affect long time windows and significant amounts of memory.
> What those are is often non-obvious and needs measurements.
>
>>
>> Or did you find the downside of the check here?
>
> The usual problem is how to test it. That tends to be harder
> than just writing the code. If it's not tested it's probably
> not worth having.
>

We did the test with our own test method. And the problem happened
as we expected really.

The method needs kernel part and user part. They are listed as following.
1. Kernel part
A. Debug interface
- check whether the THP aligned page belongs to THP.
- set the page position to be poisoned.
- set the flag whether 4K page or THP in khugepaged daemon will
be poisoned.
- split the requested THP to 4K pages.

B. A daemon poison_sched
Make poison_sched daemon call memory_failure().

C. Changes in khugepaged for debug.
- Check whether the requested page will be collapsed.
- Set poison information for poison_sched daemon
when the requested page will be collapsed.
- print the poison information to kernel log
when the page has been poisoned.

2. User part
A test APL
- Request memory which may be containing THP.
- Set test conditions with debug interface.

The steps for our own test are like following:
1. APL requests memory and check whether the THP aligned page is
THP with debug interface. If the THP aligned page is not THP,
APL will be restarted until THP is mapped.

2. APL set the page position being poisoned and the flag
whether 4K page or THP in khugepaged daemon is poisoned
with debug interface.

3. APL requests to split the requested THP with debug interface.
Here kernel must remember the split THP page address and pfn
for later page collapse.

(Waiting for page collapse ...)

4. When khugepaged daemon collapses the remembered split THP address
and pfn, khugepaged daemon will set poison information
for poison_sched daemon.

5. khugepaged daemon will do its work continually, and poison_sched
daemon will call memory_failure() deal with poisoned page
at the same time.

6. khugepaged daemon will print poison information to kernel log.
And whether the APL will be killed or not will be checked
by ourselves.

After we confirmed the above problem, the patch set is also implemented to
be tested. we confirmed the patch set could resolve the problem we got.

Thanks.

Best Regards,
Jin Dongming

> -Andi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/