Re: Very slow unlockall()
From: Michal Hocko
Date: Wed Feb 10 2021 - 11:59:03 EST
On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
> On 2/1/21 8:19 PM, Milan Broz wrote:
> > On 01/02/2021 19:55, Vlastimil Babka wrote:
> >> On 2/1/21 7:00 PM, Milan Broz wrote:
> >>> On 01/02/2021 14:08, Vlastimil Babka wrote:
> >>>> On 1/8/21 3:39 PM, Milan Broz wrote:
> >>>>> On 08/01/2021 14:41, Michal Hocko wrote:
> >>>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
> >>>>>>> and someone tried to use it with hardened memory allocator library.
> >>>>>>>
> >>>>>>> Execution time was increased to extreme (minutes) and as we found, the problem
> >>>>>>> is in munlockall().
> >>>>>>>
> >>>>>>> Here is a plain reproducer for the core without any external code - it takes
> >>>>>>> unlocking on Fedora rawhide kernel more than 30 seconds!
> >>>>>>> I can reproduce it on 5.10 kernels and Linus' git.
> >>>>>>>
> >>>>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
> >>>>>>> The real code of course does something more useful but the problem is the same.
> >>>>>>>
> >>>>>>> #include <stdio.h>
> >>>>>>> #include <stdlib.h>
> >>>>>>> #include <fcntl.h>
> >>>>>>> #include <sys/mman.h>
> >>>>>>>
> >>>>>>> int main (int argc, char *argv[])
> >>>>>>> {
> >>>>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >>
> >> So, this is 2TB memory area, but PROT_NONE means it's never actually populated,
> >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
> >> PROT_WRITE there, the mlockall() starts taking ages.
> >>
> >> So does that reflect your use case? munlockall() with large PROT_NONE areas? If
> >> so, munlock_vma_pages_range() is indeed not optimized for that, but I would
> >> expect such scenario to be uncommon, so better clarify first.
> >
> > It is just a simple reproducer of the underlying problem, as suggested here
> > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301
> >
> > We use mlockall() in cryptsetup and with hardened malloc it slows down unlock significantly.
> > (For the real case problem please read the whole issue report above.)
>
> OK, finally read through the bug report, and learned two things:
>
> 1) the PROT_NONE is indeed intentional part of the reproducer
> 2) Linux mailing lists still have a bad reputation and people avoid them. That's
> sad :( Well, thanks for overcoming that :)
>
> Daniel there says "I think the Linux kernel implementation of mlockall is quite
> broken and tries to lock all the reserved PROT_NONE regions in advance which
> doesn't make any sense."
>
> >From my testing this doesn't seem to be the case, as the mlockall() part is very
> fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts
> to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is
> slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We probably
> can't just skip them, as they might actually contain mlocked pages if those were
> faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE.
Mlock code is quite easy to misunderstand but IIRC the mlock part
should be rather straightforward. It will mark VMAs as locked, do some
merging/splitting where appropriate and finally populate the range by
gup. This should fail because VMA doesn't allow neither read nor write,
right? And mlock should report that. mlockall will not bother because it
will ignore errors on population. So there is no page table walk
happening.
> And the munlock (munlock_vma_pages_range()) is slow, because it uses
> follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
> always traversing all levels of page tables from scratch. Funnily enough,
> speeding this up was my first linux-mm series years ago. But the speedup only
> works if pte's are present, which is not the case for unpopulated PROT_NONE
> areas. That use case was unexpected back then. We should probably convert this
> code to a proper page table walk. If there are large areas with unpopulated pmd
> entries (or even higher levels) we would traverse them very quickly.
Yes, this is a good idea. I suspect it will be little bit tricky without
duplicating a large part of gup page table walker.
--
Michal Hocko
SUSE Labs