Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure()

From: Jane Chu
Date: Thu Aug 19 2021 - 16:51:09 EST



On 8/19/2021 2:10 AM, ruansy.fnst@xxxxxxxxxxx wrote:
From: Jane Chu <jane.chu@xxxxxxxxxx>
Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure()

Sorry, correction in line.

On 8/19/2021 12:18 AM, Jane Chu wrote:
Hi, Shiyang,

>  > > 1) What does it take and cost to make >  > >
xfs_sb_version_hasrmapbt(&mp->m_sb) to return true?
>
> Enable rmpabt feature when making xfs filesystem >     `mkfs.xfs
-m rmapbt=1 /path/to/device` > BTW, reflink is enabled by default.

Thanks!  I tried
mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m
rmapbt=1 -f /dev/pmem0

Again, injected a HW poison to the first page in a dax-file, had the
poison consumed and received a SIGBUS. The result is better -

** SIGBUS(7): canjmp=1, whichstep=0, **
** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4,
BUS_MCEERR_AR) **

The SIGBUS payload looks correct.

However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes -

Actually that's one per 2MB, even though the poison is located in pfn 0x1850600
only.


[ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.507956]
Memory failure: 0x1850800: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.531681]
Memory failure: 0x1850a00: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.554190]
Memory failure: 0x1850c00: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.575831]
Memory failure: 0x1850e00: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.596796]
Memory failure: 0x1851000: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption ....
[ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.758885]
Memory failure: 0x1950000: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.779495]
Memory failure: 0x1950200: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.800106]
Memory failure: 0x1950400: Sending SIGBUS to
fsdax_poison_v1:4109 due to hardware memory corruption

That's too much for a single process dealing with a single poison in a
PMD page. If nothing else, given an .si_addr_lsb being 0x15, it
doesn't make sense to send a SIGBUS per 512B block.

Could you determine the user process' mapping size from the
filesystem, and take that as a hint to determine how many iterations
to call
mf_dax_kill_procs() ?

Sorry, scratch the 512byte stuff... the filesystem has been notified the length of
the poison blast radius, could it take clue from that?

I think this is caused by a mistake I made in the 6th patch: xfs handler iterates the file range in block size(4k here) even though it is a PMD page. That's why so many message shows when poison on a PMD page. I'll fix it in next version.


Sorry, just to clarify, it looks like XFS has iterated through out the
entire file in 2MiB stride. The test file size is 4GiB, that explains
'dmesg' showing 2048 line about sending SIGBUS.

thanks,
-jane



--
Thanks,
Ruan.


thanks,
-jane


thanks!
-jane