Re: [patch 0/4] extensible prctl task isolation interface and vmstat sync (v2)

From: Marcelo Tosatti
Date: Tue Aug 10 2021 - 15:10:58 EST


On Tue, Aug 10, 2021 at 06:40:48PM +0200, Thomas Gleixner wrote:
> Marcelo,
>
> On Fri, Jul 30 2021 at 17:18, Marcelo Tosatti wrote:
>
> can you pretty please:
>
> 1) Add a version number to your patch series right where it belongs:
>
> [patch V2 N/M]
>
> Just having a (v2) at the end of the subject line of 0/M is sloppy
> at best.
>
> 2) Provide a lore link to the previous version
>
> Thanks,
>
> tglx

Thomas,

Sure, will resend -v3 once done with the following:

1) Adding support for KVM.

2) Adding a tool called "chisol" to util-linux, similar
to chrt/taskset, to prctl+exec (for unmodified applications).

This raises the question whether or not to add an option to preserve
the task parameters across fork (i think the answer is yes).

--

But the following points are unclear to me (in quotes are earlier
comments you made):

1) "It's about silencing different and largely independent parts of the OS
on a particular CPU. Just defining upfront that there is only the choice
of all or nothing _is_ policy.

There is a very wide range of use case scenarios out there and just
because the ones which you care about needs X does not mean that X is
the right thing for everybody else. You still can have X and let other
people define their own set of things they want to be protected
against.

Aside of that having it selectively is a plus for debugability, testing
etc."

So for the ability to individually select what parts of the OS
on a particular CPU are quiesced, there is:

+ defmask = defmask | ISOL_F_QUIESCE_VMSTATS;
+
+ ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
+ 0, 0);
+ if (ret == -1) {
+ perror("prctl PR_ISOL_SET");
+ return EXIT_FAILURE;
+ }

However there is a feeling that implementation details are being exposed
to userspace... However that seems to be alright: what could happen is that
the feature ceases to exist (say vmstat sync), in kernel, and the bit
is kept for compability (but the kernel does nothing about it).

That of course means whatever "vmstat sync" replacement comes up, it should
avoid IPIs as well.

Any thoughts on this?

2) "Again: I fundamentaly disagree with the proposed task isolation patches
approach as they leave no choice at all.

There is a reasonable middle ground where an application is willing to
pay the price (delay) until the reqested quiescing has taken place in
order to run undisturbed (hint: cache ...) and also is willing to take
the addtional overhead of an occacional syscall in the slow path without
tripping some OS imposed isolation safe guard.

Aside of that such a granular approach does not necessarily require the
application to be aware of it. If the admin knows the computational
pattern of the application, e.g.

1 read_data_set() <- involving syscalls/OS obviously
2 compute_set() <- let me alone
3 save_data_set() <- involving syscalls/OS obviously

repeat the above...

then it's at his discretion to decide to inflict a particular isolation
set on the task which is obviously ineffective while doing #1 and #3 but
might provide the so desired 0.9% boost for compute_set() which
dominates the judgement.

That's what we need to think about and once we figured out how to do
that it gives Marcelo the mechanism to solve his 'run virt undisturbed
by vmstat or whatever' problem and it allows Alex to build his stuff on
it.

Summary: The problem to be solved cannot be restricted to

self_defined_important_task(OWN_WORLD);

Policy is not a binary on/off problem. It's manifold across all levels
of the stack and only a kernel problem when it comes down to the last
line of defence.

Up to the point where the kernel puts the line of last defence, policy
is defined by the user/admin via mechanims provided by the kernel.

Emphasis on "mechanims provided by the kernel", aka. user API.

Just in case, I hope that I don't have to explain what level of scrunity
and thought this requires."

OK, so perhaps a handful of use-cases can clarify whether the proposed
interface requires changes?

The example on samples/task_isolation/ is focused on "enter task isolation
and very rarely exit".

There are two other cases i am aware of:

A) Christoph's use-case:

1) Enter task-isolation.
2) Latency sensitive loop begins.
3) Some event interrupts latency sensitive section.

4) Handling of the event requires N syscalls, which the programmer
would be interested in happening without quiescing at
every return to system call. The current scheme would be:


/*
* Application can either set the value from ISOL_F_QUIESCE_DEFMASK,
* which is configurable through
* /sys/kernel/task_isolation/default_quiesce_activities,
* or specific values.
*
* Using ISOL_F_QUIESCE_DEFMASK allows for the application to
* take advantage of future quiescing capabilities without
* modification (provided default_quiesce_activities is
* configured accordingly).
*/
defmask = defmask | ISOL_F_QUIESCE_VMSTATS;

ret = prctl(PR_ISOL_SET, ISOL_F_QUIESCE, defmask,
0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_SET");
return EXIT_FAILURE;
}

lat_loop:
ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);
if (ret == -1) {
perror("prctl PR_ISOL_CTRL_SET (ISOL_F_QUIESCE)");
return EXIT_FAILURE;
}

latency sensitive loop

if (event == 1) {
/* disables quiescing of all features, while maintaining
* other features such as logging and avoidance of
* interruptions enabled.
*/
ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);
syscall1
syscall2
...
syscallN
/* reenter isolated mode with quiescing */
goto lat_loop;
}
...

Should it be possible to modify individual quiescing parts individually
while maintaining isolated mode? Yes, that seems to be desired.


The other use-case (from you) seems to be:

1 read_data_set() <- involving syscalls/OS obviously
2 compute_set() <- let me alone
3 save_data_set() <- involving syscalls/OS obviously

repeat the above...

Well, the implementation of Christoph's use above seems not
to be that bad as well:

1 read_data_set() <- involving syscalls/OS obviously
/* disables quiescing of all (or some, if desired)
* features, while maintaining other features such
* as logging and avoidance of interruptions enabled.
*/
ret = prctl(PR_ISOL_CTRL_SET, ISOL_F_QUIESCE, 0, 0, 0);

2 compute_set() <- let me alone

ret = prctl(PR_ISOL_CTRL_SET, 0, 0, 0, 0);

3 save_data_set() <- involving syscalls/OS obviously

repeat the above...

What kind of different behaviour, other than enabling/disabling
quiescing, would be desired in this use-case?