Re: [PATCH 0/3] memory tiering: hot page selection
From: ying.huang@xxxxxxxxx
Date: Mon Apr 11 2022 - 04:17:09 EST
Hi, Andrew,
Thanks a lots for your comments!
On Fri, 2022-04-08 at 21:07 -0700, Andrew Morton wrote:
> On Fri, 8 Apr 2022 15:12:19 +0800 Huang Ying <ying.huang@xxxxxxxxx>
> wrote:
>
> > To optimize page placement in a memory tiering system with NUMA
> > balancing, the hot pages in the slow memory node need to be
> > identified. Essentially, the original NUMA balancing
> > implementation
> > selects and promote the mostly recently accessed (MRU) pages. But
> > the
> > recently accessed pages may be cold.
>
> Wait. What. How on earth could the most recently accessed page be
> considered "cold"?
>
> Changelog needs work, please.
Sorry for confusing. What I really want to say is,
The pages with low accessing frequency should be kept in slow memory
node. But these pages will still be accessed, so they may be selected
and migrated to the fast memory node by the original NUMA balancing
implementation upon accessing.
Are the new words better?
> > So in this patchset, we
> > implement a new hot page identification algorithm based on the
> > latency
> > between NUMA balancing page table scanning and hint page fault.
>
> That sounds reasonable.
>
> > And the hot page promotion can incur some overhead in the system.
> > To
> > control the overhead a simple promotion rate limit mechanism is
> > implemented.
>
> That sounds like a hack to fix a problem you just added? Maybe a
> reasonable one..
Sorry again for confusing. Try to reword it as follows,
If there are hot pages in slow memory node and cold pages in fast
memory node, we need to promote/demote hot/cold pages between the fast
and cold memory nodes.
One choice is to promote/demote as fast as possible. But the CPU
cycles and memory bandwidth consumed by the fast promoting/demoting may
hurt the performance of the workload, e.g., the workload latency.
One way to resolve this issue is to restrict the max promoting/demoting
bandwidth. It will take longer to finish the promoting/demoting. But
the workload latency may be better. This is implemented in this
patchset as the page promotion rate limit mechanism.
> > The hot threshold used to identify the hot pages is workload
> > dependent
> > usually. So we also implemented a hot threshold automatic
> > adjustment
> > algorithm. The basic idea is to increase/decrease the hot
> > threshold
> > to make the number of pages that pass the hot threshold (promote
> > candidate) near the rate limit.
>
> hot threshold is tweakable via debugfs. So it's an unstable
> interface,
> undocumented, may be withdrawn at any time, etc.
>
> Please justify this. Is it intended that this interface be removed?
> If yes, please describe the plan for this in the changelog. If no
> then
> this interface should be in sysfs, it should be fully documented in
> the
> kernel tree and it should be fully changelogged (preferably with
> usage
> examples) so that reviewers can best understand what is a permanent
> extension to the kernel API which we must maintain for all time.
Previously, I had added a hot threshold sysfs interface. But later, I
thought that it's hard for the users to determine right hot threshold
value. On the other hand, in patch [3/3], we implemented a hot
threshold automatic adjustment algorithm, which make it not so
important to set the hot threshold from the user space. So I moved the
hot threshold setting interface to debugfs just for users to
experiment. If it turns out that a formal interface is needed finally,
I will add one.
> Does this code build correctly if !LAST_CPUPID_NOT_IN_PAGE_FLAGS?
>
I have relied on 0-Day kernel build service to capture the build bug of
more configurations and architectures. But I just tried i386
architcture, and found that NUMA balancing isn't supported there, so I
cannot build test that for i386 too. Should I try more?
> > TODO: Add ABI document for new sysctl knob.
>
> um, yes.
>
Sure. Will do that in the next version.
> check_cpupid() is a poor function name. Check for what? Liver
> damage?
> A better identifier would be cpupid_valid(), perhaps. And perhaps it
> should be implemented in mm.h. And even documented.
Yes. cpupid_valid() is much better. I will add document for it in the
next version.
> Oddly, the proposed changes do a decent job of documenting
> intra-function changes but a poor job of documenting newly added
> functions. Please review every new function and decide whether it is
> so simple and obvious that it can be added to Linux without any
> inline
> description at all.
>
> pgdat_free_space_enough() and numa_hint_fault_latency().
Sure. Will check all new functions and add document if necessary.
> numa_migration_check_rate_limit() particularly. There's that "check"
> word again. Check for what?
>
The original idea is to check whether the migration is ratelimited.
Will change this to numa_migration_rate_limit(). It will return true
if migration rate exceed the limit.
>
> The numa_balancing_rate_limit_mbps sysctl. I assume "b" means
> "bytes".
> That's better than "pages", but still. Machines vary a lot in how
> many bytes they have and in the speed at which they process those
> bytes. Is there some metric which we can use here which at least
> partially insulates us from this variability? So that sysadmins
> don't
> need to set all their machines differently, based upon their size and
> speed?
Will change the sysctl to numb_balancing_promote_rate_limit_MBps to
avoid possible confusing.
The default rate limit is 64 GB/s, that is, no limitation. As Mel
Gorman suggested.
To reduce the possible disturbing to the workload, A rule of thumb is
to set it to be 10% of the max slow memory bandwidth. But we don't
know how to detect that automatically yet.
> numa_threshold_ts is apparently in units of jiffies. This was not
> documented at the definition site, and it should be. But jiffies can
> vary between machines and configs. Why add this inter-machine and
> inter-config uncertainty/variability when we have
> include/linux/ktime.h?
>
We need to call cmpxchg() on it, so ktime_t may be not the best choice
here. But I got your idea. I will use some more general time unit,
e.g. milli-second.
Best Regards,
Huang, Ying