Re: [PATCH v4 1/1] exec: seal system mappings
From: Thomas Weißschuh
Date: Thu Feb 06 2025 - 10:28:55 EST
On Thu, Feb 06, 2025 at 09:38:59AM -0500, enh wrote:
> On Thu, Feb 6, 2025 at 8:20 AM Thomas Weißschuh
> <thomas.weissschuh@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Jan 17, 2025 at 02:35:18PM -0500, enh wrote:
> > > On Fri, Jan 17, 2025 at 1:20 PM Jeff Xu <jeffxu@xxxxxxxxxxxx> wrote:
> >
> > <snip>
> >
> > > > There are technical difficulties to seal vdso/vvar from the glibc
> > > > side. The dynamic linker lacks vdso/vvar mapping size information, and
> > > > architectural variations for vdso/vvar also means sealing from the
> > > > kernel side is a simpler solution. Adhemerval has more details in case
> > > > clarification is needed from the glibc side.
> > >
> > > as a maintainer of a different linux libc, i've long wanted a "tell me
> > > everything there is to know about this vma" syscall rather than having
> > > to parse /proc/maps...
> > >
> > > ...but in this special case, is the vdso/vvar size ever anything other
> > > than "one page" in practice?
> >
> > x86 has two additional vvar pages for virtual clocks.
> > (Since v6.13 even split into their own mapping)
> > Loongarch has per-cpu vvar data which is larger than one page.
> > The vdso mapping is however many pages the code ends up being compiled as,
> > for example on my current x86_64 distro kernel it's two pages.
> > In the near future, probably v6.14, vvars will be split over multiple
> > pages in general [0].
>
> /me checks the nearest arm64 phone ... yeah, vdso is still only one
> page there but vvars is already more than one.
Probably due to CONFIG_TIME_NS, see below.
> is there a TL;DR (or RTFM link) for why this is so big? a quick look
> at the x86 suggests there should only be 640 bytes of various things
> plus a handful of bytes for the rng, and while arm64 looks very
> different, that looks like it's explicitly asking for a page (with the
> vdso_data_store stuff)? (i've never had any reason to look at vvars
> before, only vdso.)
I don't think there is any real manual.
The vvar data is *shared* between the kernel and userspace.
This is done by mapping the *same* physical memory into the kernel
("vdso_data_store") and (read-only) into all userspace processes.
As PTEs always cover a full page and the kernel can not expose random
other internal kernel data into userspace, the vvars need to be in their
own dedicated page.
(The same is true for the vDSO code, uprobe trampoline, etc... mappings)
The vDSO functions also need to be aware of time namespaces. This is
implemented by allocating one page per namespace and mapping this
in place of the regular vvar page. But the vDSO still needs to access
the regular vvar page for some information, so both are mapped.
Then on top come the rng state and some architecture-specific data.
These are currently part of the time page. So they also have to dance
around the time namespace mapping shenanigans. In addition they have to
coexist with the actual time data, which is currently done by manually
calculating byte offsets for them in the time page and hardcoding those.
The linked series cleans this up by moving things into dedicated pages.
To make the code easier to understand and to make it possible to
add new data to the time page without running out of space or
introducing conflicts which need to be detected manually.
While this needs to allocate more pages, these are shared between the
whole system, so effectively it's cheap. It also requires more virtual
memory space in each process, but that shouldn't matter.
As for arm64 looking very different from x86: Hopefully not for long :-)
> > Figuring out the start and size from /proc/maps, or the new
> > PROCMAP_QUERY ioctl, is not trivial, due to architectural variations.
>
> (obviously it's unsatisfying as a general interface, but in practice
> the VMAs i see asked about about directly -- rather than just rounded
> up in a diagnostic dump -- are either stacks ["what are the bounds of
> this stack, and does it have guard pages already?"] or code ["what
> file was the code at this pc mapped in from?"]. so while the vdso
> would come up, we'd never notice if vvars didn't work. if your sp/pc
> point there, we were already just going to bail anyway :-) )
Fair enough.
This information was also a response to Jeff's parent mail,
as it would be relevant when sealing the mappings from ld.so.
<snip>
> > [0] https://lore.kernel.org/lkml/20250204-vdso-store-rng-v3-0-13a4669dfc8c@xxxxxxxxxxxxx/