Re: [PATCH v4 4/4] selftests/mm: add tests for HWPOISON hugetlbfs read

From: Muhammad Usama Anjum
Date: Wed Jan 10 2024 - 05:15:47 EST


On 1/10/24 11:49 AM, Muhammad Usama Anjum wrote:
> On 1/6/24 2:13 AM, Jiaqi Yan wrote:
>> On Thu, Jan 4, 2024 at 10:27 PM Muhammad Usama Anjum
>> <usama.anjum@xxxxxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to convert this test to TAP as I think the failures sometimes go
>>> unnoticed on CI systems if we only depend on the return value of the
>>> application. I've enabled the following configurations which aren't already
>>> present in tools/testing/selftests/mm/config:
>>> CONFIG_MEMORY_FAILURE=y
>>> CONFIG_HWPOISON_INJECT=m
>>>
>>> I'll send a patch to add these configs later. Right now I'm trying to
>>> investigate the failure when we are trying to inject the poison page by
>>> madvise(MADV_HWPOISON). I'm getting device busy every single time. The test
>>> fails as it doesn't expect any business for the hugetlb memory. I'm not
>>> sure if the poison handling code has issues or test isn't robust enough.
>>>
>>> ./hugetlb-read-hwpoison
>>> Write/read chunk size=0x800
>>> ... HugeTLB read regression test...
>>> ... ... expect to read 0x200000 bytes of data in total
>>> ... ... actually read 0x200000 bytes of data in total
>>> ... HugeTLB read regression test...TEST_PASSED
>>> ... HugeTLB read HWPOISON test...
>>> [ 9.280854] Injecting memory failure for pfn 0x102f01 at process virtual
>>> address 0x7f28ec101000
>>> [ 9.282029] Memory failure: 0x102f01: huge page still referenced by 511
>>> users
>>> [ 9.282987] Memory failure: 0x102f01: recovery action for huge page: Failed
>>> ... !!! MADV_HWPOISON failed: Device or resource busy
>>> ... HugeTLB read HWPOISON test...TEST_FAILED
>>>
>>> I'm testing on v6.7-rc8. Not sure if this was working previously or not.
>>
>> Thanks for reporting this, Usama!
>>
>> I am also able to repro MADV_HWPOISON failure at "501a06fe8e4c
>> (akpm/mm-stable, mm-stable) zswap: memcontrol: implement zswap
>> writeback disabling."
>>
>> Then I checked out the earliest commit "ba91e7e5d15a (HEAD -> Base)
>> selftests/mm: add tests for HWPOISON hugetlbfs read". The
>> MADV_HWPOISON injection works and and the test passes:
>>
>> ... HugeTLB read HWPOISON test...
>> ... ... expect to read 0x101000 bytes of data in total
>> ... !!! read failed: Input/output error
>> ... ... actually read 0x101000 bytes of data in total
>> ... HugeTLB read HWPOISON test...TEST_PASSED
>> ... HugeTLB seek then read HWPOISON test...
>> ... ... init val=4 with offset=0x102000
>> ... ... expect to read 0xfe000 bytes of data in total
>> ... ... actually read 0xfe000 bytes of data in total
>> ... HugeTLB seek then read HWPOISON test...TEST_PASSED
>> ...
>>
>> [ 2109.209225] Injecting memory failure for pfn 0x3190d01 at process
>> virtual address 0x7f75e3101000
>> [ 2109.209438] Memory failure: 0x3190d01: recovery action for huge
>> page: Recovered
>> ...
>>
>> I think something in between broken MADV_HWPOISON on hugetlbfs, and we
>> should be able to figure it out via bisection (and of course by
>> reading delta commits between them, probably related to page
>> refcount).
> Thank you for this information.
>
>>
>> That being said, I will be on vacation from tomorrow until the end of
>> next week. So I will get back to this after next weekend. Meanwhile if
>> you want to go ahead and bisect the problematic commit, that will be
>> very much appreciated.
> I'll try to bisect and post here if I find something.
Found the culprit commit by bisection:

a08c7193e4f18dc8508f2d07d0de2c5b94cb39a3
mm/filemap: remove hugetlb special casing in filemap.c

hugetlb-read-hwpoison started failing from this patch. I've added the
author of this patch to this bug report.

>
>>
>> Thanks,
>> Jiaqi
>>
>>
>>>
>>> Regards,
>>> Usama
>>>