Re: [PATCH v8 0/4] Introduce mseal

From: Jeff Xu
Date: Thu Feb 01 2024 - 22:15:33 EST


On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>
> * Jeff Xu <jeffxu@xxxxxxxxxxxx> [240131 20:27]:
> > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > <Liam.Howlett@xxxxxxxxxx> wrote:
> > >
>
> Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.

> Initial library mappings happen in one huge chunk then it's cut up into
> smaller VMAs, at least that's what I see with my maple tree tracing. If
> you opt-in, then the entire library will have to opt-in and so the
> 'discourage inadvertent sealing' argument is not very strong.
>
Regarding "The initial library mappings happen in one huge chunk then
it is cut up into smaller VMAS", this is not a problem.

As example of elf loading (fs/binfmt_elf.c), there is just a few
places to pass in what type of memory to be allocated, e.g.
MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those
places.
If glic does additional splitting on the memory range, by using
mprotect(), then the MAP_SEALABLE is automatically applied after
splitting.
If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).

> It also makes a somewhat messy tracking of inheritance of the attribute
> across splitting, MAP_FIXED replacement, vma_move, vma_copy. I think
> most of this is forced on the user?
>
The inheritance is the same as other VMA flags.

> It makes your call less flexible, it means you have to hope that the VMA
> origin was blessed before you decide you want to mseal it.
>
> What if you want to ensure the library mapped by a parent or on launch
> is mseal'ed?
>
> What about the initial relocated VMA (expand/shrink of VMA)?
>
> Creating something as "non-sealable" is pointless. If you don't want it
> sealed, then don't mseal() that region.
>
> If your use case doesn't need it, then can we please drop the opt-in
> behaviour and just have all VMAs treated the same?
>
> If it does need it, can you explain why?
>
> The glibc relocation/fixup will then work. glibc could mseal once it is
> complete - or an application could bypass glibc support and use the
> feature itself.

Yes. That is the idea.

>
> If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the
> heap/stack concerns. We can either let people shoot their own feet off
> or try to protect them.
>
> Right now, you seem to be trying to protect them. Keeping with that, I
> guess we could either get the kernel to mark those VMAs or tell some
> other way? I'd suggest a range, but people do very strange things with
> these special VMAs [1]. I don't think you can predict enough crazy
> actions to make a difference in trying to protect people.
>
> There are far fewer VMAs that should not be allowed to be mseal'ed than
> should be, and the kernel creates those so it seems logical to only let
> the kernel opt-out on those ones.
>
> I'd rather just let people shoot themselves and return an error.
>
> I also hope it reduces the complexity of this code while increasing the
> flexibility of the feature. As stated before, we remove the dependency
> of needing support from the initial loader.
>
> Merging VMAs
> I can see this going Very Bad with brk + mseal. But, again, if someone
> decides to mseal these VMAs then they should expect Bad Things to
> happen (or maybe they know what they are doing even in some complex
> situation?)
>
> vma_merge() can also expand a VMA. I think this is okay as it checks
> for the same flags, so you will allow VMA expansion of two (or three)
> vma areas to become one. Is this okay in your model?
>
> >
> > > I mean, you specifically state that this is a 'very specific
> > > requirement' in your cover letter. Does this mean even other browsers
> > > have no use for it?
> > >
> > No, I don’t mean “other browsers have no use for it”.
> >
> > About specific requirements from Chrome, that refers to "The lifetime
> > of those mappings are not tied to the lifetime of the process, which
> > is not the case of libc" as in the cover letter. This addition to the
> > cover letter was made in V3, thus, it might be beneficial to provide
> > additional context to help answer the question.
> >
> > This patch series begins with multiple-bit approaches (v1,v2,v3), the
> > rationale for this is that I am uncertain if Chrome's specific needs
> > are common enough for other use cases. Consequently, I am unable to
> > make this decision myself without input from the community. To
> > accommodate this, multiple bits are selected initially due to their
> > adaptability.
> >
> > Since V1, after hearing from the community, Chrome has changed its
> > design (no longer relying on separating out mprotect), and Linus
> > acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
> > today mseal() has a simple design that:
> > - meet Chrome's specific needs.
>
> How many VMAs will chrome have that are mseal'ed? Is this a common
> operation?
>
> PROT_SEAL seems like an extra flag we could drop. I don't expect we'll
> be sealing enough VMAs that a hand full of extra syscalls would make a
> difference?
>
> > - meet Libc's needs.
>
> What needs of libc are you referring to? I'm looking through the
> version changelog and I guess you mean return EPERM?
>
I meant libc's sealing RO part of the elf binary, those memory's
lifetime are associated with the lifetime of the process.

> > - Chrome's specific need doesn't interfere with Libc's.
> >
> > [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@xxxxxxxxxxxxxx/
>
> Linus said he'd be happier if we made the change in general.
>
> >
> > > I am very concerned this feature will land and have to be maintained by
> > > the core mm people for the one user it was specifically targeting.
> > >
> > See above. This feature is not specifically targeting Chrome.
> >
> > > Can we also get some benchmarking on the impact of this feature? I
> > > believe my answer in v7 removed the worst offender, but since there is
> > > no benchmarking we really are guessing (educated or not, hard data would
> > > help). We still have an extra loop in madvise, mprotect_pkey, mremap_to
> > > (and mreamp syscall?).
> > >
> > Yes. There is an extra loop in mmap(FIXED), munmap(),
> > madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
> > address range. I suspect the impact would be low, but having some hard
> > data would be good. I will see what I can find to assist the perf
> > testing. If you have a specific test suite in mind, I can also try it.
>
> You should look at mmtests [2]. But since you are adding loops across
> VMA ranges, you need to test loops across several ranges of VMAs. That
> is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or
> some subset of small and large numbers to get an idea of complexity we
> are adding. My hope is that the looping will be cache-hot in the maple
> tree and have minimum effect.
>
> In my personal testing, I've seen munmap often do a single VMA, or 3, or
> more rare 7 on x86_64. There should be some good starting points in
> mmtests for the common operations.
>
Thanks. Will do.


> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c
> [2] https://github.com/gormanm/mmtests
>
> Thanks,
> Liam