Re: [PATCH 2/6] selftests/resctrl: Ensure measurements skip initialization of default benchmark

From: Ilpo Järvinen
Date: Mon Sep 09 2024 - 08:53:18 EST


On Fri, 6 Sep 2024, Reinette Chatre wrote:
> On 9/6/24 3:00 AM, Ilpo Järvinen wrote:
> > On Thu, 5 Sep 2024, Reinette Chatre wrote:
> > > On 9/5/24 5:10 AM, Ilpo Järvinen wrote:
> > > > On Wed, 4 Sep 2024, Reinette Chatre wrote:
> > > > > On 9/4/24 4:57 AM, Ilpo Järvinen wrote:
> > > > > > On Fri, 30 Aug 2024, Reinette Chatre wrote:
> > > > > > > On 8/30/24 3:56 AM, Ilpo Järvinen wrote:
> > > > > > > > On Thu, 29 Aug 2024, Reinette Chatre wrote:
> > > >
> > > > > > > > > @@ -699,111 +639,80 @@ int resctrl_val(const struct
> > > > > > > > > resctrl_test
> > > > > > > > > *test,
> > > > > > > > > return ret;
> > > > > > > > > }
> > > > > > > > > - /*
> > > > > > > > > - * If benchmark wasn't successfully started by child,
> > > > > > > > > then
> > > > > > > > > child
> > > > > > > > > should
> > > > > > > > > - * kill parent, so save parent's pid
> > > > > > > > > - */
> > > > > > > > > ppid = getpid();
> > > > > > > > > - if (pipe(pipefd)) {
> > > > > > > > > - ksft_perror("Unable to create pipe");
> > > > > > > > > + /* Taskset test to specified CPU. */
> > > > > > > > > + ret = taskset_benchmark(ppid, uparams->cpu,
> > > > > > > > > &old_affinity);
> > > > > > > >
> > > > > > > > Previously only CPU affinity for bm_pid was set but now it's set
> > > > > > > > before
> > > > > > > > fork(). Quickly checking the Internet, it seems that CPU
> > > > > > > > affinity
> > > > > > > > gets
> > > > > > > > inherited on fork() so now both processes will have the same
> > > > > > > > affinity
> > > > > > > > which might make the other process to interfere with the
> > > > > > > > measurement.
> > > > > > >
> > > > > > > Setting the affinity is intended to ensure that the buffer
> > > > > > > preparation
> > > > > > > occurs in the same topology as where the runtime portion will run.
> > > > > > > This preparation is done before the work to be measured starts.
> > > > > > >
> > > > > > > This does tie in with the association with the resctrl group and I
> > > > > > > will elaborate more below ...
> > > > > >
> > > > > > Okay, that's useful to retain but thinking this further, now we're
> > > > > > also
> > > > > > going do non-trivial amount of work in between the setup and the
> > > > > > test by
> > > > >
> > > > > Could you please elaborate how the amount of work during setup can be
> > > > > an
> > > > > issue? I have been focused on the measurements that are done
> > > > > afterwards
> > > > > that do have clear boundaries from what I can tell.
> > > >
> > > > Well, you used it as a justification: "Setting the affinity is intended
> > > > to ensure that the buffer preparation occurs in the same topology as
> > > > where
> > > > the runtime portion will run." So I assumed you had some expectations
> > > > about
> > > > "preparations" done outside of those "clear boundaries" but now you seem
> > > > to take entirely opposite stance?
> > >
> > > I do not follow you here. With the "clear boundaries" I meant the
> > > measurements and associated variables that have a clear start/init and
> > > stop/read while the other task sleeps. Yes, preparations are done outside
> > > of this since that should not be measured.
> >
> > Of course the preparations are not measured (at least not after this
> > patch :-)).
> >
> > But that's not what I meant. You said the preparations are to be done
> > using the same topology as the test but if it's outside of the measurement
> > period, the topology during preparations only matters if you make some
> > hidden assumption that some state carries from preparations to the actual
> > test. If no state carry-over is assumed, the topology during preparations
> > is not important. So which way it is? Is the topology during preparations
> > important or not?
>
> Topology during preparations is important.
>
> In the CMT test this is more relevant with the transition to using
> memflush = false. The preparation phase and measure phase uses the same
> cache alloc configuration and just as importantly, the same monitoring
> configuration. During preparation the cache portion that will be
> used during measurement will be filled with the contents of the buffer.
> During measurement it will be the same cache portion into which any new reads
> will be allocated and measured. In fact, the preparation phase will thus form
> part of the measurement. If preparation was done with different
> configuration, then I see a problem whether memflush = true as well as when
> memflush = false. In the case of memflush = true it will have the familiar
> issue of the test needing to "guess" the workload settle time. In the case
> of memflush = false the buffer will remain allocated into the cache portion
> used during preparation phase, when the workload runs it will read the
> data from a cache portion that does not belong to it and since it does
> not need to allocate into its own cache portion its LLC occupancy counts will
> not be accurate (the LLC occupancy associated with the buffer will be
> attributed
> to prepare portion).
>
> I am not familiar with the details of memory allocation but as I understand
> Linux does attempt to satisfy memory requests from the node to which the
> requesting CPU is assigned. For the MBM and MBA tests I thus believe it is
> important to allocate the memory from where it will be used. I have
> encountered
> systems where CPU0 and CPU1 are on different sockets and by default the
> workload
> is set to run on CPU1. If the preparation phase runs on CPU0 then it may be
> that memory could be allocated from a different NUMA node than where the
> workload will
> be running. By doing preparation within the same topology as what the
> workload will be running I believe that memory will be allocated appropriate
> to workload and thus result in more reliable measurements.
>
> >
> > You used the topology argument to justify why both parent and child are
> > now in the same resctrl group unlike before when only the actual test was.
> >
> > > You stated "now we're also going
> > > do non-trivial amount of work in between the setup and the test" ...
> > > could you clarify what the problem is with this? Before this work
> > > the "non-trivial amount of work" (for "fill_buf") was done as part of the
> > > test with (wrong) guesses about how long it takes. This work aims to
> > > improve
> > > this.
> >
> > I understand why you're trying to do with this change.
> >
> > However, I was trying to say that before preparations occurred right
> > before the test which is no longer the case because there's fork() in
> > between.
>
> If by "test" you mean the measurement phase then in the case of "fill_buf"
> preparations only now reliably occur before the measurement. Original behavior
> is maintained with user provided benchmark.
>
> >
> > Does that extra work impact the state carry-over from preparations to the
> > test?
>
> It is not clear to me what extra work or state you are referring to.
>
> >
> > > > fork() quite heavy operation as it has to copy various things including
> > > > the address space which you just made to contain a huge mem blob. :-)
> > >
> > > As highlighted in a comment found in the patch, the kernel uses
> > > copy-on-write
> > > and all tests only read from the buffer after fork().
> >
> > Write test is possible using -b fill_buf ... as mentioned in comments to
> > the other patch.
>
> Yes, it is theoretically possible, but looking closer it is not supported.
> Note
> how measure_mem_bw() is always called with hardcoded "reads". Reading through
> the history of commits I do not believe modifying fill_buf parameters was
> ever intended to be a use case for the "-b" parameter. It seems intended
> to provide an alternate benchmark to fill_buf.
>
> I am not interested in adding support for the write test because I do not see
> how it helps us to improve resctrl subsystem health. Considering that
> nobody has ever complained about the write test being broken I think all
> that dead code should just be removed instead.
>
> I prefer the focus be on the health of the resctrl subsystem instead of
> additional
> hardware performance testing. I do not think it is energy well spent to
> further
> tune these performance tests unless it benefits resctrl subsystem health.
> These
> performance tests are difficult to maintain and I have not yet seen how they
> benefit the resctrl subsystem.
>
> > > > BTW, perhaps we could use some lighter weighted fork variant in the
> > > > resctrl selftests, the processes don't really need to be that separated
> > > > to justify using full fork() (and custom benchmarks will do execvp()).
> > >
> > > Are you thinking about pthreads? It is not clear to me that this is
> > > necessary. It is also not clear to me what problem you are describing that
> > > needs this solution. When I understand that better it will be easier to
> > > discuss solutions.
> >
> > I was trying to say I see advantage of retaining the address space which
> > is not possible with fork(), so perhaps using clone() with CLONE_VM would
> > be useful and maybe some other flags too. I think this would solve the
> > write test case.
>
> clone() brings with it the complexity of needing to manage the child's
> stack. This again aims for a performance improvement. What does this fix?
> What is the benefit to resctrl health? I would prefer that our energy
> instead be focused on improving resctrl subsystem health.

Fair enough.

--
i.