Re: [PATCH v3 00/13] "Task_isolation" mode

From: Alex Belits
Date: Thu Apr 09 2020 - 11:10:52 EST



On Sat, 2020-03-07 at 19:42 -0800, Alex Belits wrote:
> This is the updated version of task isolation patchset.
>
> 1. Commit messages updated to match changes.
> 2. Sign-off lines restored from original patches, changes listed
> wherever applicable.
> 3. arm platform -- added missing calls to syscall check and cleanup
> procedure after leaving isolation.
> 4. x86 platform -- added missing calls to cleanup procedure after
> leaving isolation.
>

Another update, addressing CPU state / race conditions.

I believe, I have some usable solution for the problem of both missing
the events and race conditions on isolation entry and exit.

The idea is to make sure that CPU core remains in userspace and runs
userspace code regardless of what is happening in kernel and userspace
in the rest of the system, however any events that results in running
anything other than userspace code will result in CPU core
re-synchronizing with the rest of the system. Then any kernel code,
with the exception of small and clearly defined set of routines that
only perform kernel entry / exit themselves, will run on CPU after all
synchronization is done.

This does require an answer to possible races between isolation entry
/ exit (including isolation breaking on interrupts) and updates that
are normally carried by IPIs. So the solution should involve some
mechanism that limits what runs on CPU in its "stale" state, and
causes inevitable synchronization before the rest of the kernel is
called. This should also include any preemption -- if preemtion
happens in that "stale" state after entering the kernel but before
synchronization is completed, it should still go through
synchronization before running the rest of the kernel.

Then as long as it can be demonstrated that routines running in
"stale" state can safely run in it, and any event that would normally
require IPI, will result in entering the rest of kernel after
synchronization, race would cease to be a problem. Any sequence of
events would result in exactly the same CPU state when hitting the
rest of the kernel, as if CPU processed the update event through IPI.

I was under impression that this is already the case, however after
some closer look it appears that some barriers must be in place to
make sure that the sequence of events is enforced.

There is obviously a question of performance -- we don't want to cause
any additional flushes or add locking in anything
time-critical. Fortunately entering and exiting isolation (as opposed
to events that _potentially_ can call isolation-breaking routines) is
never performance-critical, it's what starts and ends a task that has
no performance-critical communication with kernel. So if a CPU that
has isolated task on it is running kernel code, it means that either
the task is not isolated yet (we are exiting to userspace), or it is
no longer running anything performance-critical (intentionally on exit
from isolation, or unintentionally on isolation breaking event).

Isolation state is read-mostly, and we would prefer RCU for that if we
can guarantee that "stale" state remains safe in all code that runs
until synchronization happen. I am not sure of that, so I tried to
make something more straightforward, however I might be wrong, and
RCU-ifying exit from isolation may be a better way do do it.

For now I want to make sure that there is some clearly defined small
amount of kernel code that runs before the inevitable synchronization,
and that code is unaffected by "stale" state.

I have tried to track down all call paths from kernel entry points
to the call of fast_task_isolation_cpu_cleanup(), and will post those
separately. It's possible that all architecture-specific code already
follows some clearly defined rules about this for other reasons,
however I am not that familiar with all of it, and tried to check if
existing implementation is always safe for running in "stale" state
before everything that makes task isolation call its cleanup. For now,
this is the implementation that assumes that "stale" state is safe for
kernel entry.