Re: [PATCH] mm: swap: determine swap device by using page nid

From: Aaron Lu
Date: Fri Apr 29 2022 - 06:27:07 EST


On Fri, Apr 22, 2022 at 10:00:59AM -0700, Yang Shi wrote:
> On Thu, Apr 21, 2022 at 11:24 PM Aaron Lu <aaron.lu@xxxxxxxxx> wrote:
> >
> > On Thu, Apr 21, 2022 at 04:34:09PM +0800, ying.huang@xxxxxxxxx wrote:
> > > On Thu, 2022-04-21 at 16:17 +0800, Aaron Lu wrote:
> > > > On Thu, Apr 21, 2022 at 03:49:21PM +0800, ying.huang@xxxxxxxxx wrote:
> >
> > ... ...
> >
> > > > > For swap-in latency, we can use pmbench, which can output latency
> > > > > information.
> > > > >
> > > >
> > > > OK, I'll give pmbench a run, thanks for the suggestion.
> > >
> > > Better to construct a senario with more swapin than swapout. For
> > > example, start a memory eater, then kill it later.
> >
> > What about vm-scalability/case-swapin?
> > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swapin
> >
> > I think you are pretty familiar with it but still:
> > 1) it starts $nr_task processes and each mmaps $size/$nr_task area and
> > then consumes the memory, after this, it waits for a signal;
> > 2) start another process to consume $size memory to push the memory in
> > step 1) to swap device;
> > 3) kick processes in step 1) to start accessing their memory, thus
> > trigger swapins. The metric of this testcase is the swapin throughput.
> >
> > I plan to restrict the cgroup's limit to $size.
> >
> > Considering there is only one NVMe drive attached to node 0, I will run
> > the test as described before:
> > 1) bind processes to run on node 0, allocate on node 1 to test the
> > performance when reclaimer's node id is the same as swap device's.
> > 2) bind processes to run on node 1, allocate on node 0 to test the
> > performance when page's node id is the same as swap device's.
> >

Thanks to Tim, he has found me a server that has a single Optane disk
attached to node 0.

Let's use task0_mem0 to denote tasks bound to node 0 and memory bound
to node 0 through cgroup cpuset. And for the above swapin case:
when nr_task=1:
task0_mem0 throughput: [571652, 587158, 594316], avg=584375 -> baseline
task0_mem1 throughput: [582944, 583752, 589026], avg=585240 +0.15%
task1_mem0 throughput: [569349, 577459, 581107], avg=575971 -1.4%
task1_mem1 throughput: [564482, 570664, 571466], avg=568870 -2.6%

task0_mem1 is slightly better than task1_mem0.

For nr_task=8 or nr_task=16, I also gave it a run and the result is
almost the same for all 4 cases.

> > Ying and Yang,
> >
> > Let me know what you think about the case used and the way the test is
> > conducted.
>
> Looks fine to me. To measure the latency, you could also try the below
> bpftrace script:
>

Trying to install bpftrace on an old distro(Ubuntu 16.04) is a real
pain, I gave up... But I managed to get an old bcc installed. Using
the provided funclatency script to profile 30 seconds swap_readpage(),
there is no obvious difference from the histrogram.

So for now, from the existing results, it did't show big difference.
Theoretically, for IO device, when swapping a remote page, using the
remote swap device that is at the same node as the page can reduce the
traffic of the interlink and improve performance. I think this is the
main motivation for this code change?
On swapin time, it's hard to say which node the task will run on anyway
so it's hard to say where to swap is beneficial.