Re: [PATCH v4 1/1] exec: seal system mappings

From: enh
Date: Thu Feb 06 2025 - 10:52:16 EST


On Thu, Feb 6, 2025 at 10:28 AM Thomas Weißschuh
<thomas.weissschuh@xxxxxxxxxxxxx> wrote:
>
> On Thu, Feb 06, 2025 at 09:38:59AM -0500, enh wrote:
> > On Thu, Feb 6, 2025 at 8:20 AM Thomas Weißschuh
> > <thomas.weissschuh@xxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Jan 17, 2025 at 02:35:18PM -0500, enh wrote:
> > > > On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@xxxxxxxxxxxx> wrote:
> > >
> > > <snip>
> > >
> > > > > There are technical difficulties to seal vdso/vvar from the glibc
> > > > > side. The dynamic linker lacks vdso/vvar mapping size information, and
> > > > > architectural variations for vdso/vvar also means sealing from the
> > > > > kernel side is a simpler solution. Adhemerval has more details in case
> > > > > clarification is needed from the glibc side.
> > > >
> > > > as a maintainer of a different linux libc, i've long wanted a "tell me
> > > > everything there is to know about this vma" syscall rather than having
> > > > to parse /proc/maps...
> > > >
> > > > ...but in this special case, is the vdso/vvar size ever anything other
> > > > than "one page" in practice?
> > >
> > > x86 has two additional vvar pages for virtual clocks.
> > > (Since v6.13 even split into their own mapping)
> > > Loongarch has per-cpu vvar data which is larger than one page.
> > > The vdso mapping is however many pages the code ends up being compiled as,
> > > for example on my current x86_64 distro kernel it's two pages.
> > > In the near future, probably v6.14, vvars will be split over multiple
> > > pages in general [0].
> >
> > /me checks the nearest arm64 phone ... yeah, vdso is still only one
> > page there but vvars is already more than one.
>
> Probably due to CONFIG_TIME_NS, see below.
>
> > is there a TL;DR (or RTFM link) for why this is so big? a quick look
> > at the x86 suggests there should only be 640 bytes of various things
> > plus a handful of bytes for the rng, and while arm64 looks very
> > different, that looks like it's explicitly asking for a page (with the
> > vdso_data_store stuff)? (i've never had any reason to look at vvars
> > before, only vdso.)
>
> I don't think there is any real manual.
>
> The vvar data is *shared* between the kernel and userspace.
> This is done by mapping the *same* physical memory into the kernel
> ("vdso_data_store") and (read-only) into all userspace processes.
> As PTEs always cover a full page and the kernel can not expose random
> other internal kernel data into userspace, the vvars need to be in their
> own dedicated page.
> (The same is true for the vDSO code, uprobe trampoline, etc... mappings)
>
> The vDSO functions also need to be aware of time namespaces. This is
> implemented by allocating one page per namespace and mapping this
> in place of the regular vvar page. But the vDSO still needs to access
> the regular vvar page for some information, so both are mapped.

ah, i see. yeah, that makes sense. (amusingly, i almost quipped "it's
not like there are _that_ many clocks to go in there" in my previous
mail, forgetting that there are effectively an unbounded number of
clocks thanks to this feature!)

> Then on top come the rng state and some architecture-specific data.
> These are currently part of the time page. So they also have to dance
> around the time namespace mapping shenanigans. In addition they have to
> coexist with the actual time data, which is currently done by manually
> calculating byte offsets for them in the time page and hardcoding those.
>
> The linked series cleans this up by moving things into dedicated pages.
> To make the code easier to understand and to make it possible to
> add new data to the time page without running out of space or
> introducing conflicts which need to be detected manually.
> While this needs to allocate more pages, these are shared between the
> whole system, so effectively it's cheap. It also requires more virtual
> memory space in each process, but that shouldn't matter.
>
>
> As for arm64 looking very different from x86: Hopefully not for long :-)

(even as someone who doesn't work on the kernel, things like this are
always helpful --- just having one thing to understand/your first grep
being relevant is much nicer than "oh, wait ... which architecture was
that?".)

> > > Figuring out the start and size from /proc/maps, or the new
> > > PROCMAP_QUERY ioctl, is not trivial, due to architectural variations.
> >
> > (obviously it's unsatisfying as a general interface, but in practice
> > the VMAs i see asked about about directly -- rather than just rounded
> > up in a diagnostic dump -- are either stacks ["what are the bounds of
> > this stack, and does it have guard pages already?"] or code ["what
> > file was the code at this pc mapped in from?"]. so while the vdso
> > would come up, we'd never notice if vvars didn't work. if your sp/pc
> > point there, we were already just going to bail anyway :-) )
>
> Fair enough.
>
> This information was also a response to Jeff's parent mail,
> as it would be relevant when sealing the mappings from ld.so.
>
> <snip>
>
> > > [0] https://lore.kernel.org/lkml/20250204-vdso-store-rng-v3-0-13a4669dfc8c@xxxxxxxxxxxxx/