Re: [PATCH -V3 0/3] memory tiering: hot page selection

From: Baolin Wang
Date: Sun Jun 19 2022 - 23:19:29 EST

Next message: CK Hu: "Re: [PATCH v11 05/10] drm/mediatek: Add MT8195 Embedded DisplayPort driver"
Previous message: Peng Fan (OSS): "[PATCH] arm64: dts: imx8mp: correct clock of pgc_ispdwp"
Next in thread: Huang, Ying: "Re: [PATCH -V3 0/3] memory tiering: hot page selection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/14/2022 4:16 PM, Huang Ying wrote:

To optimize page placement in a memory tiering system with NUMA
balancing, the hot pages in the slow memory nodes need to be
identified. Essentially, the original NUMA balancing implementation
selects the mostly recently accessed (MRU) pages to promote. But this
isn't a perfect algorithm to identify the hot pages. Because the
pages with quite low access frequency may be accessed eventually given
the NUMA balancing page table scanning period could be quite long
(e.g. 60 seconds). So in this patchset, we implement a new hot page
identification algorithm based on the latency between NUMA balancing
page table scanning and hint page fault. Which is a kind of mostly
frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to
promote/demote hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible. But the CPU cycles
and memory bandwidth consumed by the high promoting/demoting
throughput will hurt the latency of some workload because of accessing
inflating and slow memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput. It will take longer to finish the promoting/demoting.
But the workload latency will be better. This is implemented in this
patchset as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent. So in this patchset, a method to adjust the hot threshold
automatically is implemented. The basic idea is to control the number
of the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on
a 2-socket server system with DRAM and PMEM installed. The test
results are as follows,

pmbench score promote rate
(accesses/s) MB/s
------------- ------------
base 146887704.1 725.6
hot selection 165695601.2 544.0
rate limit 162814569.8 165.2
auto adjustment 170495294.0 136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases
about 4.7%, and promote rate decrease about 17.1%, compared with rate
limit patch.

I did a simple testing with mysql on my machine which contains 1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5% from below data, and I think this is a good start to optimize the promotion. So for this series, please feel free to add:

Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
Tested-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>

Without this patchset:
transactions: 2080188 (3466.48 per sec.)

With this patch set:
transactions: 2174296 (3623.40 per sec.)

Next message: CK Hu: "Re: [PATCH v11 05/10] drm/mediatek: Add MT8195 Embedded DisplayPort driver"
Previous message: Peng Fan (OSS): "[PATCH] arm64: dts: imx8mp: correct clock of pgc_ispdwp"
Next in thread: Huang, Ying: "Re: [PATCH -V3 0/3] memory tiering: hot page selection"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]