Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL
From: Gregory Price
Date: Tue Apr 09 2024 - 20:00:35 EST
On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote:
> Hi Gregory,
>
> On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price <gregory.price@xxxxxxxxxxxx> wrote:
> > Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> > but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> >
> > The question I have is exactly how often is MIGRATE_HOT actually being
> > utilized, and how much data is being moved. Testing MIGRATE_COLD only
> > would at least give a rough approximation of that.
>
> To explain this, I better share more test results. In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
>
> *. "Turn on DAMON."
> 1. Allocate cold memory(mmap+memset) at DRAM node, then make the
> process sleep.
> 2. Launch redis-server and load prebaked snapshot image, dump.rdb.
> (85GB consumed: 52GB for anon and 33GB for file cache)
Aha! I see now, you are allocating memory to ensure the real workload
(redis-server) pressures the DRAM tier and causes "spillage" to the CXL
tier, and then measure the overhead in different scenarios.
I would still love to know what the result of a demote-only system would
produce, mosty because it would very clearly demonstrate the value of
the demote+promote system when the system is memory-pressured.
Given the additional results below, it shows a demote-only system would
likely trend toward CXL-only, and so this shows an affirmative support
for the promotion logic.
Just another datum that is useful and paints a more complete picture.
> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4. I will call this test as "DAMON lazy". This
> is missing part from the cover letter.
>
> 1. Allocate cold memory(mmap+memset) at DRAM node, then make the
> process sleep.
> 2. Launch redis-server and load prebaked snapshot image, dump.rdb.
> (85GB consumed: 52GB for anon and 33GB for file cache)
> *. "Turn on DAMON."
>
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
>
This is excellent and definitely demonstrates part of the picture I was
alluding to, thank you for the additional data!
>
> I have included "DAMON lazy" result and also the migration size by new
> DAMOS migrate actions. Please note that demotion size is way higher
> than promotion because promotion target is only for redis data, but
> demotion target includes huge cold memory allocated by mmap + memset.
> (there could be some ping-pong issue though.)
>
> As you mentioned, "DAMON tiered" case gets more benefit because new
> redis allocations go to DRAM more than "default", but it also gets
> benefit from promotion when it is under higher memory pressure as shown
> in 490G and 500G cases. It promotes 22GB and 17GB of redis data to DRAM
> from CXL.
I think a better way of saying this is that "DAMON tiered" more
effectively mitigates the effect of memory-pressure on faster tier
before spillage occurs, while "DAMON lazy" demonstrates the expected
performance of the system after memory pressure outruns the demotion
logic, so you wind up with hot data stuck in the slow tier.
There are some out there that would simply say "just demote more
aggressively", so this is useful information for the discussion.
+/- ~2% despite greater meomry migration is an excellent result
> > Can you also provide the DRAM-only results for each test? Presumably,
> > as workload size increases from 440G to 500G, the system probably starts
> > using some amount of swap/zswap/whatever. It would be good to know how
> > this system compares to swap small amounts of overflow.
>
> It looks like my explanation doesn't correctly inform you. The size
> from 440GB to 500GB is for pre allocated cold data to give memory
> pressure on the system so that redis-server cannot be fully allocated at
> fast DRAM, then partially allocated at CXL memory as well.
>
Yes, sorry for the misunderstanding. This makes it much clearer.
>
> I hope my explanation is helpful for you to understand. Please let me
> know if you have more questions.
>
Excellent work, exciting results! Thank you for the additional answers
:]
~Gregory