Re: [mm] 2d146aa3aa: vm-scalability.throughput -36.4% regression
From: Feng Tang
Date: Thu Sep 02 2021 - 09:39:34 EST
On Thu, Sep 02, 2021 at 12:53:06PM +0200, Michal Koutn?? wrote:
> Hi.
>
> On Thu, Sep 02, 2021 at 11:46:28AM +0800, Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > > Narrowing it down to a single prefetcher seems good enough to me. The
> > > behavior of the prefetchers is fairly complicated and hard to predict, so I
> > > doubt you'll ever get a 100% step by step explanation.
>
> My layman explanation with the available information is that the
> prefetcher somehow behaves as if it marked the offending cacheline as
> modified (even though reading only) therefore slowing down the remote reader.
But this can't explain the test that adding 128 bytes before css->cgroup
can restore/improve the performance.
> On Thu, Sep 02, 2021 at 09:35:58AM +0800, Feng Tang <feng.tang@xxxxxxxxx> wrote:
> > @@ -139,10 +139,21 @@ struct cgroup_subsys_state {
> > /* PI: the cgroup that this css is attached to */
> > struct cgroup *cgroup;
> >
> > + struct cgroup_subsys_state *parent;
> > +
> > /* PI: the cgroup subsystem that this css is attached to */
> > struct cgroup_subsys *ss;
>
> Hm, an interesting move; be mindful of commit b8b1a2e5eca6 ("cgroup:
> move cgroup_subsys_state parent field for cache locality"). It might be
> a regression for systems with cpuacct root css present. (That is likely
> a big amount nowadays, that may be the reason why you don't see full
> recovery? For future, we may at least guard cpuacct_charge() with
> cgroup_subsys_enabled() static branch.)
Goot catch!
Acutally I also tested only moving 'destroy_work' and 'destroy_rwork'
('parent' is not touched with the cost of 8 bytes more padding), which
has simliar effect that restore to about 15% regression.
> > [snip]
> > Yes, I'm afriad so, given that the policy/algorithm used by perfetcher
> > keeps changing from generation to generation.
>
> Exactly. I'm afraid of relayouting the structure with each new
> generation. A robust solution is putting all frequently accessed members
> into individual cache-lines + separating them with one more cache line? :-/
Yes, this is hard. Even for my debug patch, we can only say it works
as restoring the regression partly, but not knowing the exact reason.
Thansk,
Feng
>
> Michal