Re: [PATCH RESEND v15 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

From: Peter Xu
Date: Wed May 24 2023 - 09:56:49 EST


On Wed, May 24, 2023 at 04:26:33PM +0500, Muhammad Usama Anjum wrote:
> On 5/24/23 12:43 AM, Peter Xu wrote:
> > Hi, Muhammad,
> >
> > On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote:
> >> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote:
> >>> On 4/26/23 7:13 PM, Peter Xu wrote:
> >>>> Hi, Muhammad,
> >>>>
> >>>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
> >>>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
> >>>>>> +/* Supported flags */
> >>>>>> +#define PM_SCAN_OP_GET (1 << 0)
> >>>>>> +#define PM_SCAN_OP_WP (1 << 1)
> >>>>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
> >>>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
> >>>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
> >>>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
> >>>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
> >>>>>
> >>>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
> >>>>> vs
> >>>>> 2) UFFDIO_WRITEPROTECT
> >>>>>
> >>>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
> >>>>> getting really good performance which is comparable just like we are
> >>>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
> >>>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
> >>>>> performance and behavior wise.
> >>>>>
> >>>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
> >>>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
> >>>>> as I don't have tests comparing them one-to-one.
> >>>>>
> >>>>> What are your thoughts about it? Have you thought about making
> >>>>> UFFDIO_WRITEPROTECT perform better?
> >>>>>
> >>>>> I'm sorry to mention the word "performance" here. Actually we want better
> >>>>> performance to emulate Windows syscall. That is why we are adding this
> >>>>> functionality. So either we need to see what can be improved in
> >>>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
> >>>>> pagemap_ioctl?
> >>>>
> >>>> I'm fine if you want to add it back if it works for you. Though before
> >>>> that, could you remind me why there can be a difference on performance?
> >>> I've looked at the code again and I think I've found something. Lets look
> >>> at exact performance numbers:
> >>>
> >>> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
> >>> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
> >>> the average write time to the same memory which is being WP-ed and total
> >>> time of execution of these APIs:
> >
> > What is the steps of the test? Is it as simple as "writeprotect",
> > "unprotect", then write all pages in a single thread?
> >
> > Is UFFDIO_WRITEPROTECT sent in one range covering all pages?
> >
> > Maybe you can attach the test program here too.
>
> I'd not attached the test earlier as I thought that you wouldn't be
> interested in running the test. I've attached it now. The test has multiple

Thanks. No plan to run it, just to make sure I understand why such a
difference.

> threads where one thread tries to get status of flags and reset them, while
> other threads write to that memory. In main(), we call the pagemap_scan
> ioctl to get status of flags and reset the memory area as well. While in N
> threads, the memory is written.
>
> I usually run the test by following where memory area is of 100000 * pages:
> ./win2_linux 8 100000 1 1 0
>
> I'm running tests on real hardware. The results are pretty consistent. I'm
> also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to
> UFFDIO_WRITEPROTECT.

If it's multi-threaded test especially when the ioctl runs together with
the writers, then I'd assume it's caused by writers frequently need to
flush tlb (when writes during UFFDIO_WRITEPROTECT), the flush target could
potentially also include the core running the main thread who is also
trying to reprotect because they run on the same mm.

This makes me think that your current test case probably is the worst case
of Nadav's patch 6ce64428d6 because (1) the UFFDIO_WRITEPROTECT covers a
super large range, and (2) there're a _lot_ of concurrent writers during
the ioctl, so all of them will need to trigger a tlb flush, and that tlb
flush will further slow down the ioctl sender.

While I think that's the optimal case sometimes, of having minimum tlb
flush on the ioctl(UFFDIO_WRITEPROTECT), so maybe it makes sense somewhere
else where concurrent writers are not that much. I'll need to rethink a bit
on all these to find out whether we can have a good way for both..

For now, if your workload is mostly exactly like your test case, maybe you
can have your pagemap version of WP-only op there, making sure tlb flush is
within the pgtable lock critical section (so you should be safe even
without Nadav's patch). If so, I'd appreciate you can add some comment
somewhere about such difference of using pagemap WP-only and
ioctl(UFFDIO_WRITEPROTECT), though. In short, functional-wise they should
be the same, but trivial detail difference on performance as TBD (maybe one
day we can have a good approach for all and make them aligned again, but
maybe that also doesn't need to block your work).

--
Peter Xu