Re: [RFC PATCH v1.9 12/14] livepatch: create per-task consistency model
From: Miroslav Benes
Date: Fri Apr 15 2016 - 05:18:16 EST
On Thu, 14 Apr 2016, Josh Poimboeuf wrote:
> On Thu, Apr 14, 2016 at 11:25:18AM +0200, Miroslav Benes wrote:
> > On Fri, 25 Mar 2016, Josh Poimboeuf wrote:
> >
> > > +/*
> > > + * klp_update_task_universe() - change the patched state of a task
> > > + * @task: The task to change
> > > + *
> > > + * Converts the patched state of the task so that it will switch to the set of
> > > + * functions in the goal universe.
> > > + */
> > > +static inline void klp_update_task_universe(struct task_struct *task)
> > > +{
> > > + /*
> > > + * The corresponding write barriers are in klp_init_transition() and
> > > + * klp_start_transition(). See the comments there for an explanation.
> > > + */
> > > + smp_rmb();
> > > +
> > > + task->klp_universe = klp_universe_goal;
> > > +}
> >
> > I wonder whether we should introduce a static_key for this function.
> > klp_update_task_universe() is called in some important code paths and it
> > is called everytime if CONFIG_LIVEPATCH is on. It does not matter much for
> > syscall paths, because we enter slowpath there only if TIF_KLP_NEED_UPDATE
> > is set and that is set iff patching is in progress. But we call the
> > function also elsewhere. So maybe it would be nice to introduce static_key
> > in the function which would be on if the patching is in progress and off
> > if not.
> >
> > Maybe it is just an overoptimization though.
>
> klp_update_task_universe() is called from:
>
> - exit_to_usermode_loop() (slow path for syscall/irq exit)
> - copy_process() (though I'll be changing this code so children inherit
> parents' flags)
> - init_idle()
> - cpu_idle_loop()
> - various places in livepatch code
>
> I think none of those are fast paths, so my feeling is that a static key
> would be overkill.
Agreed.
> > > +static bool klp_try_switch_task(struct task_struct *task)
> > > +{
> > > + struct rq *rq;
> > > + unsigned long flags;
> > > + int ret;
> > > + bool success = false;
> > > +
> > > + /* check if this task has already switched over */
> > > + if (task->klp_universe == klp_universe_goal)
> > > + return true;
> >
> > [...]
> >
> > > +/*
> > > + * This function can be called in the middle of an existing transition to
> > > + * reverse the direction of the universe goal. This can be done to effectively
> > > + * cancel an existing enable or disable operation if there are any tasks which
> > > + * are stuck in the starting universe.
> > > + */
> > > +void klp_reverse_transition(void)
> > > +{
> > > + struct klp_patch *patch = klp_transition_patch;
> > > +
> > > + klp_start_transition(!klp_universe_goal);
> > > + klp_try_complete_transition();
> > > +
> > > + patch->enabled = !patch->enabled;
> > > +}
> >
> > This function could be called iff the patching is in progress, there are
> > some not-yet-migrated tasks and we wait for scheduled workqueue to run
> > klp_try_complete_transition() again, right? I have two questions.
> >
> > 1. Is the call to klp_try_complete_transition() here necessary? It would
> > be called via workqueue, wouldn't it? I suspect we don't gain much with
> > this and we introduce possible future problems because of "double
> > success" of klp_try_complete_transition() (nothing serious there as of
> > now though). But I might be wrong because I got really confused by this
> > piece of code and its context :)
>
> Yeah, the call to klp_try_complete_transition() isn't necessary because
> of the workqueue. It's still kind of nice to have though, because it
> tries to transition immediately rather than waiting for the work (and we
> might end up deciding to change the delayed work frequency to be less
> often than once per second).
Yes, there are potential benefits with lower frequency.
> I can't really envision that future bugs related to "double success"
> would be much to worry about. The work function does check for
> klp_transition_patch beforehand, and if it didn't,
> klp_complete_transition() would die loudly with a NULL pointer
> dereference. Of course, it's always hard to anticipate future code
> changes and I could be completely wrong.
Ah, right. There's a check and we clear klp_transition_patch in
klp_try_complete_transition() (I always forget there is a call to
klp_complete_transition "hidden" there :)).
> So I would vote to keep this call, but I don't feel too strongly about
> it since it isn't strictly needed.
Yeah, let's leave it as it is because there is no real harm.
> > 2. klp_reverse_transition() works well because not-yet-migrated tasks are
> > still in KLP_UNIVERSE_OLD (without loss of generality) and since
> > klp_universe_goal is swapped they are in fact in their target universe.
> > klp_try_switch_task() returns immediately in such case. The rest is
> > migrated in an ordinary way.
> >
> > What about TIF_KLP_NEED_UPDATE (I know it is introduced in later patch but
> > it is easier to discuss it here)? klp_start_transition() resets the flag
> > for all the task. Shouldn't the flag be cleared for the task in
> > klp_try_switch_task() if task->klp_universe == klp_universe_goal? In fact
> > it does not matter if it is set. The patching success is determined by
> > klp_func->klp_universe booleans. But those tasks would needlessly go
> > through syscall slowpaths (see note above about static_key) which could
> > hurt performance.
> >
> > Does this make sense?
>
> I agree with you, but for a different reason :-)
>
> And actually there's another way this could happen: if a task is forked
> during klp_start_transition() after setting the universe goal but before
> setting the thread flags, it would already be transitioned but its flag
> would get unnecessarily set. (Though as I mentioned above, the
> copy_process() code will be changed anyway, so that should no longer be
> an issue).
>
> But I don't think it's really a performance issue, because:
>
> 1) Assuming a reliable stack, in most cases the majority of tasks are
> transitioned immediately. So when reversing the transition, there
> would be only a few tasks that would get the flag set unnecessarily.
Correct. Majority of tasks would get the flag cleared before they even get
a change to go through the syscall gate.
> 2) For those tasks which do have the flag set unnecessarily, only the
> next syscall will do the slow path. After that the flag gets
> cleared. So it only slows down a single syscall for each task. At a
> macro level, that's just a drop in the bucket for performance and,
> IMO, not worth worrying about and adding (potentially buggy) code
> for.
Makes sense.
> However, I can think of another reason setting the flag unnecessarily
> might be bad: it's just conceptually confusing and could maybe lead to
> bugs. (ok, here we go trying to predict the future again!)
>
> Specifically, after klp_complete_transition(), a reasonable livepatch
> developer (is that an oxymoron? ;-)) might assume that all tasks'
> TIF_KLP_NEED_UPDATE flags are cleared and might write code which makes
> that assumption.
>
> And, even with today's code it's kind of hard to convince myself that
> having a spurious TIF_KLP_NEED_UPDATE flag set from a previous patching
> operation won't cause issues on the next patch.
>
> So maybe we should do as you suggest and clear the flag in
> klp_try_switch_task() if the task is already in the goal universe.
>
> Or if we wanted to be paranoid we could just clear all the flags in
> klp_complete_transition(), but maybe that's a little too sloppy/lazy?
I'll leave it up to you, because I am not able to decide that. Both
options seem good to me. The former is somewhat more appealing, the latter
is a safe bet.
Thanks,
Miroslav