Re: [PATCH v1] mm, pagemap: expose hwpoison entry

From: David Hildenbrand
Date: Wed Oct 27 2021 - 03:15:43 EST


On 27.10.21 09:02, Peter Xu wrote:
> On Wed, Oct 27, 2021 at 03:45:13PM +0900, Naoya Horiguchi wrote:
>> On Wed, Oct 27, 2021 at 10:09:03AM +0800, Peter Xu wrote:
>>> On Wed, Oct 27, 2021 at 08:27:36AM +0900, Naoya Horiguchi wrote:
>>>> On Mon, Oct 04, 2021 at 11:32:28PM +0900, Naoya Horiguchi wrote:
>>>>> On Mon, Oct 04, 2021 at 01:55:30PM +0200, David Hildenbrand wrote:
>>>>>> On 04.10.21 13:50, Naoya Horiguchi wrote:
>>>> ...
>>>>>>>
>>>>>>> Hwpoison entry for hugepage is also exposed by this patch. The below
>>>>>>> example shows how pagemap is visible in the case where a memory error
>>>>>>> hit a hugepage mapped to a process.
>>>>>>>
>>>>>>> $ ./page-types --no-summary --pid $PID --raw --list --addr 0x700000000+0x400
>>>>>>> voffset offset len flags
>>>>>>> 700000000 12fa00 1 ___U_______Ma__H_G_________________f_______1
>>>>>>> 700000001 12fa01 1ff ___________Ma___TG_________________f_______1
>>>>>>> 700000200 12f800 1 __________B________X_______________f______w_
>>>>>>> 700000201 12f801 1 ___________________X_______________f______w_ // memory failure hit this page
>>>>>>> 700000202 12f802 1fe __________B________X_______________f______w_
>>>>>>>
>>>>>>> The entries with both of "X" flag (hwpoison flag) and "w" flag (swap
>>>>>>> flag) are considered as hwpoison entries. So all pages in 2MB range
>>>>>>> are inaccessible from the process. We can get actual error location
>>>>>>> by page-types in physical address mode.
>>>>>>>
>>>>>>> $ ./page-types --no-summary --addr 0x12f800+0x200 --raw --list
>>>>>>> offset len flags
>>>>>>> 12f800 1 __________B_________________________________
>>>>>>> 12f801 1 ___________________X________________________
>>>>>>> 12f802 1fe __________B_________________________________
>>>>>>>
>>>>>>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
>>>>>>> ---
>>>>>>> fs/proc/task_mmu.c | 41 ++++++++++++++++++++++++++++++++---------
>>>>>>> include/linux/swapops.h | 13 +++++++++++++
>>>>>>> tools/vm/page-types.c | 7 ++++++-
>>>>>>> 3 files changed, 51 insertions(+), 10 deletions(-)
>>>>>>
>>>>>>
>>>>>> Please also update the documentation located at
>>>>>>
>>>>>> Documentation/admin-guide/mm/pagemap.rst
>>>>>
>>>>> I will do this in the next post.
>>>>
>>>> Reading the document, I found that swap type is already exported so we
>>>> could identify hwpoison entry with it (without new PM_HWPOISON bit).
>>>> One problem is that the format of swap types (like SWP_HWPOISON) depends
>>>> on a few config macros like CONFIG_DEVICE_PRIVATE and CONFIG_MIGRATION,
>>>> so we also need to export how the swap type field is interpreted.
>>>
>>> I had similar question before.. though it was more on the generic swap entries
>>> not the special ones yet.
>>>
>>> The thing is I don't know how the userspace could interpret normal swap device
>>> indexes out of reading pagemap, say if we have two swap devices with "swapon
>>> -s" then I've no idea how do we know which device has which swap type index
>>> allocated. That seems to be a similar question asked above on special swap
>>> types - the interface seems to be incomplete, if not unused at all.
>>>
>>> AFAIU the information on "this page is swapped out to device X on offset Y" is
>>> not reliable too, because the pagein/pageout from kernel is transparent to the
>>> userspace and not under control of userspace at all. IOW, if the user reads
>>> that swap entry, then reads data upon the disk of that offset out and put it
>>> somewhere else, then it means the data read could already be old if kernel
>>> paged in the page after userspace reading the pagemap but before it reading the
>>> disk, and I don't see any way to make it right unless the userspace could stop
>>> the kernel from page-in a swap entry. That's why I really wonder whether we
>>> should expose normal swap entry at all, as I don't know how it could be helpful
>>> and used in the 100% right way.
>>
>> Thank you for the feedback.
>>
>> I think that a process interested in controlling swap-in/out behavior in its own
>> typically calls mincore() to get current status and madvise() to trigger swap-in/out.
>> That's not 100% solution for the same reason, but it mostly works well because
>> calling madvise(MADV_PAGEOUT) to already swapped out is not a big issue (although
>> some CPU/memory resource is wasted, but the amount of the waste is small if the
>> returned info is new enough).
>> So my point is that the concern around information newness might be more generic
>> issue rather than just for pagemap. If we need 100% accurate in-kernel info,
>> maybe it had better be done in kernel (or some cooler stuff like eBPF)?
>
> I fully agree the solution you mentioned with mincore() and madvise(), that is
> very sane and working approach. Though IMHO the major thing I wanted to point
> out is for generic swap devices we exposed (disk_index, disk_offset) tuple as
> the swap entry (besides "whether this page is swapped out or not"; that's
> PM_SWAP, and as you mentioned people'll need to rely on mincore() to make it
> right for shmem), though to use it we need to either record the index/offset or
> read/write data from it. However none of them will make sense, IMHO.. So I
> think exposing PM_SWAP makes sense, not the swap entries on swap devices.
>
>>
>>>
>>> Special swap entries seem a bit different - at least for is_pfn_swap_entry()
>>> typed swap entries we can still expose the PFN which might be helpful, which I
>>> can't tell.
>>
>> I'm one who think it helpful for testing, although I know testing might not be
>> considered as a real usecase.
>
> I think testing is valid use case too.
>
>>
>>>
>>> I used to send an email to Matt Mackall <mpm@xxxxxxxxxxx> and Dave Hansen
>>> <dave.hansen@xxxxxxxxxxxxxxx> asking about above but didn't get a reply. Ccing
>>> again this time with the list copied.
>>>
>>>>
>>>> I thought of adding new interfaces for example under /sys/kernel/mm/swap/type_format/,
>>>> which shows info like below (assuming that all CONFIG_{DEVICE_PRIVATE,MIGRATION,MEMORY_FAILURE}
>>>> is enabled):
>>>>
>>>> $ ls /sys/kernel/mm/swap/type_format/
>>>> hwpoison
>>>> migration_read
>>>> migration_write
>>>> device_write
>>>> device_read
>>>> device_exclusive_write
>>>> device_exclusive_read
>>>>
>>>> $ cat /sys/kernel/mm/swap/type_format/hwpoison
>>>> 25
>>>>
>>>> $ cat /sys/kernel/mm/swap/type_format/device_write
>>>> 28
>>>>
>>>> Does it make sense or any better approach?
>>>
>>> Then I'm wondering whether we care about the rest of the normal swap devices
>>> too with pagemap so do we need to expose some information there too (only if
>>> there's a real use case, though..)? Or... should we just don't expose swap
>>> entries at all, at least generic swap entries? We can still expose things like
>>> hwpoison via PM_* bits well defined in that case.
>>
>> I didn't think about normal swap devices for no reason. I'm OK to stop exposing
>> normal swap device part. I don't have strong option yet about which approach
>> (in swaptype or PM_HWPOISON) I'll suggest next (so wait a little more for feedback).
>
> No strong opinion here too. It's just that the new interface proposed reminded
> me that it's partially complete if considering we're also exposing swap entries
> on swap devices, so the types didn't cover those entries. However it's more
> like a pure question because I never figured out how those entries will work
> anyway. I'd be willing to know whether Dave Hanson would comment on this.
>
> While the PM_HWPOISON approach looks always sane to me.

I consider that somehow cleaner, because how HWPOISON entries are
implemented ("fake swap entries") is somewhat an internal implementation
detail.

(I also agree that PM_SWAP makes sense, but maybe really only when we're
actually dealing with something that has been/is currently being swapped
out. Maybe we should just not expose fake swap entries via PM_SWAP and
instead use proper PM_ types for that. PM_MIGRATION, PM_HWPOISON, ...)

--
Thanks,

David / dhildenb