Re: extending ucontext (Re: [PATCH v26 25/30] x86/cet/shstk: Handle signals for shadow stack)

From: H. Peter Anvin
Date: Mon May 03 2021 - 02:05:10 EST


<vedvyas.shanbhogue@xxxxxxxxx>,Dave Martin <Dave.Martin@xxxxxxx>,Weijiang Yang <weijiang.yang@xxxxxxxxx>,Pengfei Xu <pengfei.xu@xxxxxxxxx>,Haitao Huang <haitao.huang@xxxxxxxxx>
From: "H. Peter Anvin" <hpa@xxxxxxxxx>
Message-ID: <2360408E-9967-469C-96AF-E8AC40044021@xxxxxxxxx>

There are a few words at the end of the FXSAVE area for software use for this reason: so we can extend the signal frame – originally to enable saving the XSAVE state but that doesn't use up the whole software available area.

On May 2, 2021 4:23:49 PM PDT, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>On Fri, Apr 30, 2021 at 10:47 AM Andy Lutomirski <luto@xxxxxxxxxx>
>wrote:
>>
>> On Fri, Apr 30, 2021 at 10:00 AM Yu, Yu-cheng <yu-cheng.yu@xxxxxxxxx>
>wrote:
>> >
>> > On 4/28/2021 4:03 PM, Andy Lutomirski wrote:
>> > > On Tue, Apr 27, 2021 at 1:44 PM Yu-cheng Yu
><yu-cheng.yu@xxxxxxxxx> wrote:
>> > >>
>> > >> When shadow stack is enabled, a task's shadow stack states must
>be saved
>> > >> along with the signal context and later restored in sigreturn.
>However,
>> > >> currently there is no systematic facility for extending a signal
>context.
>> > >> There is some space left in the ucontext, but changing ucontext
>is likely
>> > >> to create compatibility issues and there is not enough space for
>further
>> > >> extensions.
>> > >>
>> > >> Introduce a signal context extension struct 'sc_ext', which is
>used to save
>> > >> shadow stack restore token address. The extension is located
>above the fpu
>> > >> states, plus alignment. The struct can be extended (such as the
>ibt's
>> > >> wait_endbr status to be introduced later), and sc_ext.total_size
>field
>> > >> keeps track of total size.
>> > >
>> > > I still don't like this.
>> > >
>> > > Here's how the signal layout works, for better or for worse:
>> > >
>> > > The kernel has:
>> > >
>> > > struct rt_sigframe {
>> > > char __user *pretcode;
>> > > struct ucontext uc;
>> > > struct siginfo info;
>> > > /* fp state follows here */
>> > > };
>> > >
>> > > This is roughly the actual signal frame. But userspace does not
>have
>> > > this struct declared, and user code does not know the sizes of
>the
>> > > fields. So it's accessed in a nonsensical way. The signal
>handler
>> > > function is passed a pointer to the whole sigframe implicitly in
>RSP,
>> > > a pointer to &frame->info in RSI, anda pointer to &frame->uc in
>RDX.
>> > > User code can *find* the fp state by following a pointer from
>> > > mcontext, which is, in turn, found via uc:
>> > >
>> > > struct ucontext {
>> > > unsigned long uc_flags;
>> > > struct ucontext *uc_link;
>> > > stack_t uc_stack;
>> > > struct sigcontext uc_mcontext; <-- fp pointer is in here
>> > > sigset_t uc_sigmask; /* mask last for extensibility
>*/
>> > > };
>> > >
>> > > The kernel, in sigreturn, works a bit differently. The sigreturn
>> > > variants know the base address of the frame but don't have the
>benefit
>> > > of receiving pointers to the fields. So instead the kernel takes
>> > > advantage of the fact that it knows the offset to uc and parses
>uc
>> > > accordingly. And the kernel follows the pointer in mcontext to
>find
>> > > the fp state. The latter bit is quite important later. The
>kernel
>> > > does not parse info at all.
>> > >
>> > > The fp state is its own mess. When XSAVE happened, Intel kindly
>(?)
>> > > gave us a software defined area between the "legacy" x87 region
>and
>> > > the modern supposedly extensible part. Linux sticks the
>following
>> > > structure in that hole:
>> > >
>> > > struct _fpx_sw_bytes {
>> > > /*
>> > > * If set to FP_XSTATE_MAGIC1 then this is an xstate
>context.
>> > > * 0 if a legacy frame.
>> > > */
>> > > __u32 magic1;
>> > >
>> > > /*
>> > > * Total size of the fpstate area:
>> > > *
>> > > * - if magic1 == 0 then it's sizeof(struct _fpstate)
>> > > * - if magic1 == FP_XSTATE_MAGIC1 then it's sizeof(struct
>_xstate)
>> > > * plus extensions (if any)
>> > > */
>> > > __u32 extended_size;
>> > >
>> > > /*
>> > > * Feature bit mask (including FP/SSE/extended state) that
>is present
>> > > * in the memory layout:
>> > > */
>> > > __u64 xfeatures;
>> > >
>> > > /*
>> > > * Actual XSAVE state size, based on the xfeatures saved in
>the layout.
>> > > * 'extended_size' is greater than 'xstate_size':
>> > > */
>> > > __u32 xstate_size;
>> > >
>> > > /* For future use: */
>> > > __u32 padding[7];
>> > > };
>> > >
>> > >
>> > > That's where we are right now upstream. The kernel has a parser
>for
>> > > the FPU state that is bugs piled upon bugs and is going to have
>to be
>> > > rewritten sometime soon. On top of all this, we have two
>upcoming
>> > > features, both of which require different kinds of extensions:
>> > >
>> > > 1. AVX-512. (Yeah, you thought this story was over a few years
>ago,
>> > > but no. And AMX makes it worse.) To make a long story short, we
>> > > promised user code many years ago that a signal frame fit in 2048
>> > > bytes with some room to spare. With AVX-512 this is false. With
>AMX
>> > > it's so wrong it's not even funny. The only way out of the mess
>> > > anyone has come up with involves making the length of the FPU
>state
>> > > vary depending on which features are INIT, i.e. making it more
>compact
>> > > than "compact" mode is. This has a side effect: it's no longer
>> > > possible to modify the state in place, because enabling a feature
>with
>> > > no space allocated will make the structure bigger, and the stack
>won't
>> > > have room. Fortunately, one can relocate the entire FPU state,
>update
>> > > the pointer in mcontext, and the kernel will happily follow the
>> > > pointer. So new code on a new kernel using a super-compact state
>> > > could expand the state by allocating new memory (on the heap?
>very
>> > > awkwardly on the stack?) and changing the pointer. For all we
>know,
>> > > some code already fiddles with the pointer. This is great,
>except
>> > > that your patch sticks more data at the end of the FPU block that
>no
>> > > one is expecting, and your sigreturn code follows that pointer,
>and
>> > > will read off into lala land.
>> > >
>> >
>> > Then, what about we don't do that at all. Is it possible from now
>on we
>> > don't stick more data at the end, and take the relocating-fpu
>approach?
>> >
>> > > 2. CET. CET wants us to find a few more bytes somewhere, and
>those
>> > > bytes logically belong in ucontext, and here we are.
>> > >
>> >
>> > Fortunately, we can spare CET the need of ucontext extension. When
>the
>> > kernel handles sigreturn, the user-mode shadow stack pointer is
>right at
>> > the restore token. There is no need to put that in ucontext.
>>
>> That seems entirely reasonable. This might also avoid needing to
>> teach CRIU about CET at all.
>
>Wait, what's the actual shadow stack token format? And is the token
>on the new stack or the old stack when sigaltstack is in use? For
>that matter, is there any support for an alternate shadow stack for
>signals?
>
>--Andy

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.