Re: [PATCH -V8 1/6] NUMA balancing: optimize page placement for memory tiering system

From: Yang Shi
Date: Thu Sep 16 2021 - 20:48:11 EST


On Wed, Sep 15, 2021 at 6:45 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Yang Shi <shy828301@xxxxxxxxx> writes:
>
> > On Tue, Sep 14, 2021 at 8:58 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Yang Shi <shy828301@xxxxxxxxx> writes:
> >>
> >> > On Tue, Sep 14, 2021 at 6:45 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >> >>
> >> >> Yang Shi <shy828301@xxxxxxxxx> writes:
> >> >>
> >> >> > On Mon, Sep 13, 2021 at 6:37 PM Huang Ying <ying.huang@xxxxxxxxx> wrote:
> >> >> >>
> >> >> >> With the advent of various new memory types, some machines will have
> >> >> >> multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
> >> >> >> memory subsystem of these machines can be called memory tiering
> >> >> >> system, because the performance of the different types of memory are
> >> >> >> usually different.
> >> >> >>
> >> >> >> In such system, because of the memory accessing pattern changing etc,
> >> >> >> some pages in the slow memory may become hot globally. So in this
> >> >> >> patch, the NUMA balancing mechanism is enhanced to optimize the page
> >> >> >> placement among the different memory types according to hot/cold
> >> >> >> dynamically.
> >> >> >>
> >> >> >> In a typical memory tiering system, there are CPUs, fast memory and
> >> >> >> slow memory in each physical NUMA node. The CPUs and the fast memory
> >> >> >> will be put in one logical node (called fast memory node), while the
> >> >> >> slow memory will be put in another (faked) logical node (called slow
> >> >> >> memory node). That is, the fast memory is regarded as local while the
> >> >> >> slow memory is regarded as remote. So it's possible for the recently
> >> >> >> accessed pages in the slow memory node to be promoted to the fast
> >> >> >> memory node via the existing NUMA balancing mechanism.
> >> >> >>
> >> >> >> The original NUMA balancing mechanism will stop to migrate pages if the free
> >> >> >> memory of the target node will become below the high watermark. This
> >> >> >> is a reasonable policy if there's only one memory type. But this
> >> >> >> makes the original NUMA balancing mechanism almost not work to optimize page
> >> >> >> placement among different memory types. Details are as follows.
> >> >> >>
> >> >> >> It's the common cases that the working-set size of the workload is
> >> >> >> larger than the size of the fast memory nodes. Otherwise, it's
> >> >> >> unnecessary to use the slow memory at all. So in the common cases,
> >> >> >> there are almost always no enough free pages in the fast memory nodes,
> >> >> >> so that the globally hot pages in the slow memory node cannot be
> >> >> >> promoted to the fast memory node. To solve the issue, we have 2
> >> >> >> choices as follows,
> >> >> >>
> >> >> >> a. Ignore the free pages watermark checking when promoting hot pages
> >> >> >> from the slow memory node to the fast memory node. This will
> >> >> >> create some memory pressure in the fast memory node, thus trigger
> >> >> >> the memory reclaiming. So that, the cold pages in the fast memory
> >> >> >> node will be demoted to the slow memory node.
> >> >> >>
> >> >> >> b. Make kswapd of the fast memory node to reclaim pages until the free
> >> >> >> pages are a little more (about 10MB) than the high watermark. Then,
> >> >> >> if the free pages of the fast memory node reaches high watermark, and
> >> >> >> some hot pages need to be promoted, kswapd of the fast memory node
> >> >> >> will be waken up to demote some cold pages in the fast memory node to
> >> >> >> the slow memory node. This will free some extra space in the fast
> >> >> >> memory node, so the hot pages in the slow memory node can be
> >> >> >> promoted to the fast memory node.
> >> >> >>
> >> >> >> The choice "a" will create the memory pressure in the fast memory
> >> >> >> node. If the memory pressure of the workload is high, the memory
> >> >> >> pressure may become so high that the memory allocation latency of the
> >> >> >> workload is influenced, e.g. the direct reclaiming may be triggered.
> >> >> >>
> >> >> >> The choice "b" works much better at this aspect. If the memory
> >> >> >> pressure of the workload is high, the hot pages promotion will stop
> >> >> >> earlier because its allocation watermark is higher than that of the
> >> >> >> normal memory allocation. So in this patch, choice "b" is
> >> >> >> implemented.
> >> >> >>
> >> >> >> In addition to the original page placement optimization among sockets,
> >> >> >> the NUMA balancing mechanism is extended to be used to optimize page
> >> >> >> placement according to hot/cold among different memory types. So the
> >> >> >> sysctl user space interface (numa_balancing) is extended in a backward
> >> >> >> compatible way as follow, so that the users can enable/disable these
> >> >> >> functionality individually.
> >> >> >>
> >> >> >> The sysctl is converted from a Boolean value to a bits field. The
> >> >> >> definition of the flags is,
> >> >> >>
> >> >> >> - 0x0: NUMA_BALANCING_DISABLED
> >> >> >> - 0x1: NUMA_BALANCING_NORMAL
> >> >> >> - 0x2: NUMA_BALANCING_MEMORY_TIERING
> >> >> >
> >> >> > Thanks for coming up with the patches. TBH the first question off the
> >> >> > top of my head is all the complexity is really worthy for real life
> >> >> > workload at the moment? And the interfaces (sysctl knob files exported
> >> >> > to users) look complicated for the users. I don't know if the users
> >> >> > know how to set an optimal value for their workloads.
> >> >> >
> >> >> > I don't disagree the NUMA balancing needs optimization and improvement
> >> >> > for tiering memory, the question we need answer is how far we should
> >> >> > go for now and what the interfaces should look like. Does it make
> >> >> > sense to you?
> >> >> >
> >> >> > IMHO I'd prefer the most simple and straightforward approach at the
> >> >> > moment. For example, we could just skip high water mark check for PMEM
> >> >> > promotion.
> >> >>
> >> >> Hi, Yang,
> >> >>
> >> >> Thanks for comments.
> >> >>
> >> >> I understand your concerns about complexity. I have tried to organize
> >> >> the patchset so that the initial patch is as simple as possible and the
> >> >> complexity is introduced step by step. But it seems that your simplest
> >> >> version is even simpler than my one :-)
> >> >>
> >> >> In this patch ([1/6]), I introduced 2 stuff.
> >> >>
> >> >> Firstly, a sysctl knob is provided to disable the NUMA balancing based
> >> >> promotion. Per my understanding, you suggest to remove this. If so,
> >> >> optimizing cross-socket access and promoting hot PMEM pages to DRAM must
> >> >> be enabled/disabled together. If a user wants to enable promoting the
> >> >> hot PMEM pages to DRAM but disable optimizing cross-socket access
> >> >> because they have already bound the CPU of the workload so that there's no
> >> >> much cross-socket access, how can they do?
> >> >
> >> > I should make myself clearer. Here I mean the whole series, not this
> >> > specific patch. I'm concerned that the interfaces (hint fault latency
> >> > and ratelimit) are hard to understand and configure for users and
> >> > whether we go too far at the moment or not. I'm dealing with the end
> >> > users, I'd admit I'm not even sure how to configure the knobs to
> >> > achieve optimal performance for different real life workloads.
> >>
> >> Sorry, I misunderstand your original idea. I understand that the knob
> >> isn't user-friendly. But sometimes, we cannot avoid it completely :-(
> >> In this patchset, I try to introduce the complexity and knobs one by
> >> one, and show the performance benefit of each step for people to judge
> >> whether the newly added complexity and knob can be complemented by the
> >> performance increment. If the benefit of some patches cannot complement
> >> its complexity, I am OK to merge just part of the patchset firstly.
> >
> > Understood. But I really hesitate to go that far at this moment since
> > the picture is not that clear yet IMHO. We have to support them (maybe
> > forever) once we merge them.
>
> OK. The [1-3/6] is the simplest implementation. We can start with that
> firstly?

Sure.

>
> > So I'd prefer to work on the simplest and most necessary stuff for
> > now. Just like how we dealt with demotion.
> >
> >>
> >> So how about be more specific? For example, if you are general OK about
> >> the complexity and knob introduced by [1-3/6], but have concerns about
> >> [4/6], then we can discuss about that specifically?
> >
> > Yeah, we could.
> >
> >>
> >> > For this specific patch I'm ok to a new promotion mode. There might be
> >> > usecase that users just want to do promotion between tiered memory but
> >> > not care about NUMA locality.
> >>
> >> Yes.
> >>
> >> >> Secondly, we add a promote watermark to the DRAM node so that we can
> >> >> demote/promote pages between the high and promote watermark. Per my
> >> >> understanding, you suggest just to ignore the high watermark checking
> >> >> for promoting. The problem is that this may make the free pages of the
> >> >> DRAM node too few. If many pages are promoted in short time, the free
> >> >> pages will be kept near the min watermark for a while, so that the page
> >> >> allocation from the application will trigger direct reclaiming. We have
> >> >> observed page allocation failure in a test before with a similar policy.
> >> >
> >> > The question is, applicable to the hint fault latency and ratelimit
> >> > too, we already have some NUMA balancing knobs to control scan period
> >> > and scan size and watermark knobs to tune how aggressively kswapd
> >> > works, can they do the same jobs instead of introducing any new knobs?
> >>
> >> In this specific patch, we don't introduce a new knob for the page
> >> demotion. For other knobs, how about discuss them in the patch that
> >> introduce them and one by one?
> >
> > That comment is applicable to the watermark hack in this patch too.
> > Per your above description, the problem is the significant amount of
> > promotion in short period of time may deplete free memory. So I'm
> > wondering if the amount of promotion could be ratelimited by NUMA
> > balancing scan period and scan size. I understand this may have some
> > hot pages stay on PMEM for a longer time, but does it really matter?
> > In addition, the gap between low <--> min <--> high could be adjusted
> > by watermark_scale_factor, so kswapd could work more aggressively to
> > keep free memory.
>
> We can control the NUMA balancing scan speed, but we cannot control the
> speed of the hint page faults. For example, we scanned a large portion

Could adjusting scan size help out?


> of PMEM without many hint page faults because the pages are really cold,
> but suddenly a large amount of cold pages become hot, so they will be
> promoted to DRAM. This will create heavy memory pressure on DRAM node,
> make it hard for the normal page allocation from the applications.
>
> And, for some workloads, we need to promote the hot pages to DRAM
> quickly, otherwise, the pages will become cold. We should make it
> possible to support these users too. Do you agree?

I agree there may be such workloads. But do we have to achieve very
good support for them right now? We don't even know how common such
workload is.

>
> Best Regards,
> Huang, Ying