Re: [PATCH v8 00/13] fold per-CPU vmstats remotely

From: Michal Hocko
Date: Wed May 24 2023 - 08:52:11 EST


[Sorry for a late response but I was conferencing last two weeks and now
catching up]

On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote:
[...]
> v8
> - Add summary of discussion on -v7 to cover letter

Thanks this is very useful! This helps to frame the further discussion.

I believe the most important question to answer is this in fact
> I think what needs to be done is to avoid new queue_work_on()
> users from being introduced in the tree (the number of
> existing ones is finite and can therefore be fixed).
>
> Agree with the criticism here, however, i can't see other
> options than the following:
>
> 1) Given an activity, which contains a sequence of instructions
> to execute on a CPU, to change the algorithm
> to execute that code remotely (therefore avoid interrupting a CPU),
> or to avoid the interruption somehow (which must be dealt with
> on a case-by-case basis).
>
> 2) To block that activity from happening in the first place,
> for the sites where it can be blocked (that return errors to
> userspace, for example).
>
> 3) Completly isolate the CPU from the kernel (off-line it).

I agree that a reliable cpu isolation implementation needs to address
queue_work_on problem. And it has to do that _realiably_. This cannot by
achieved by an endless whack-a-mole and chasing each new instance. There
must be a more systematic approach. One way would be to change the
semantic of schedule_work_on and fail call for an isolated CPU. The
caller would have a way to fallback and handle the operation by other
means. E.g. vmstat could simply ignore folding pcp data because an
imprecision shouldn't really matter. Other callers might chose to do the
operation remotely. This is a lot of work, no doubt about that, but it
is a long term maintainable solution that doesn't give you new surprises
with any new released kernel. There are likely other remote interfaces
that would need to follow that scheme.

If the cpu isolation is not planned to be worth that time investment
then I do not think it is also worth reducing a highly optimized vmstat
code. These stats are invoked from many hot paths and per-cpu
implementation has been optimized for that case. If your workload would
like to avoid that as disturbing then you already have a quiet_vmstat
precedence so find a way how to use it for your workload instead.

--
Michal Hocko
SUSE Labs