Re: [RFC PATCH 0/7] support for mm-local memory allocations and use it

From: Fares Mehanna
Date: Tue Oct 08 2024 - 16:11:08 EST


> > Hi,
> >
> > Thanks for taking a look and apologies for my delayed response.
> >
> > > Having a VMA in user mappings for kernel memory seems weird to say the
> > > least.
> >
> > I see your point and agree with you. Let me explain the motivation, pros and
> > cons of the approach after answering your questions.
> >
> > > Core MM does not expect to have VMAs for kernel memory. What will happen if
> > > userspace ftruncates that VMA? Or registers it with userfaultfd?
> >
> > In the patch, I make sure the pages are faulted in, locked and sealed to make
> > sure the VMA is practically off-limits from the owner process. Only after that
> > I change the permissions to be used by the kernel.
>
> And what about VMA accesses from the kernel? How do you verify that
> everything that works with VMAs in the kernel can deal with that being a
> kernel mapping rather than userspace?

I added `VM_MIXEDMAP` if the secret allocation is intended for kernel usage,
this should make the VMA special and prevent a lot of operations like VMA merging.
Maybe the usage of `VM_MIXEDMAP` is not ideal and we can introduce a new kernel
flag for that. But I'm not aware of a destructive VMA operation from kernel side
while the VMA is marked special, mixed-map and sealed.

> > > This approach seems much more reasonable and it's not that it was entirely
> > > arch-specific. There is some plumbing at arch level, but the allocator is
> > > anyway arch-independent.
> >
> > So I wanted to explore a simple solution to implement mm-local kernel secret
> > memory without much arch dependent code. I also wanted to reuse as much of
> > memfd_secret() as possible to benefit from what is done already and possible
> > future improvements to it.
>
> Adding functionality that normally belongs to userspace into mm/secretmem.c
> does not feel like a reuse, sorry.

Right, because the mapping in user virtual space most of the operations belongs
to userspace yes. I thought this way would be easier to demonstrate the approach
for RFC.

> The only thing your actually share is removal of the allocated pages from
> the direct map. And hijacking userspace mapping instead of properly
> implementing a kernel mapping does not seem like proper solution.

Also we get:
1. PGD is private when creating new process.
2. Existing kernel-secret mappings for given process will be cloned on fork(),
so no need to keep track of them to be cloned on fork().
3. No special handling for context switching.

> > Keeping the secret pages in user virtual addresses is easier as the page table
> > entries are not global by default so no special handling for spawn(). keeping
> > them tracked in VMA shouldn't require special handling for fork().
> >
> > The challenge was to keep the virtual addresses / VMA away from user control as
> > long as the kernel is using it, and signal the mm core that this VMA is special
> > so it is not merged with other VMAs.
> >
> > I believe locking the pages, sealing the VMA, prefaulting the pages should make
> > it practicality away of user space influence.
> >
> > But the current approach have those downsides: (That I can think of)
> > 1. Kernel secret user virtual addresses can still be used in functions accepting
> > user virtual addresses like copy_from_user / copy_to_user.
> > 2. Even if we are sure the VMA is off-limits to userspace, adding VMA with
> > kernel addresses will increase attack surface between userspace and the
> > kernel.
> > 3. Since kernel secret memory is mapped in user virtual addresses, it is very
> > easy to guess the exact virtual address (using binary search), and since
> > this functionality is designed to keep user data, it is fair to assume the
> > userspace will always be able to influence what is written there.
> > So it kind of breaks KASLR for those specific pages.
>
> There is even no need to guess, it will appear on /proc/pid/maps

Yeah but that is easily fixable, however the other issue stays the same unless
I allocated bigger chunk from userspace and moved away from VMA tracking.

> > 4. It locks user virtual memory away, this may break some software if they
> > assumed they can mmap() into specific places.
> >
> > One way to address most of those concerns while keeping the solution almost arch
> > agnostic is is to allocate reasonable chunk of user virtual memory to be only
> > used for kernel secret memory, and not track them in VMAs.
> > This is similar to the old approach but instead of creating non-global kernel
> > PGD per arch it will use chunk of user virtual memory. This chunk can be defined
> > per arch, and this solution won't use memfd_secret().
> > We can then easily enlighten the kernel about this range so the kernel can test
> > for this range in functions like access_ok(). This approach however will make
> > downside #4 even worse, as it will reserve bigger chunk of user virtual memory
> > if this feature is enabled.
> >
> > I'm also very okay switching back to the old approach with the expense of:
> > 1. Supporting fewer architectures that can afford to give away single PGD.
>
> Only few architectures can modify their direct map, and all these can spare
> a PGD entry.
>
> > 2. More complicated arch specific code.
>
> On x86 similar code already exists for LDT, you may want to look at Andy's
> comments on old proclocal posting:
>
> https://lore.kernel.org/lkml/CALCETrXHbS9VXfZ80kOjiTrreM2EbapYeGp68mvJPbosUtorYA@xxxxxxxxxxxxxx/

Ah I see, so no need to think about architectures that can't spare a PGD. thanks!
I read the discussion and LDT is x86 specific and I wanted to start with aarch64.

I'm still thinking about the best approach for aarch64 for my next PoC, aarch64
track two tables in TTBR0/TTBR1, what I'm thinking of is:
1. Have kernel page table per process, with all its PGD entries shared other than
a single PGD for kernel secret allocations.
2. On fork, traverse the private PGD part and clone existing page table for the
new process.
3. On context switching, write the table to TTBR1, thus the kernel will have
access to all secret allocations per this process.

This will move away from user vaddr and VMA tracking, with the expense of each
architecture will support it on its own way.

Does that sound more decent?

Thank you!
Fares.

> > Also @graf mentioned how aarch64 uses TTBR0/TTBR1 for user and kernel page
> > tables, I haven't looked at this yet but it probably means that kernel page
> > table will be tracked per process and TTBR1 will be switched during context
> > switching.
> >
> > What do you think? I would appreciate your opinion before working on the next
> > RFC patch set.
> >
> > Thanks!
> > Fares.



Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597