On Tue, Nov 29, 2022 at 5:14 AM Yongqiang Liu <liuyongqiang13@xxxxxxxxxx> wrote:
Do you mean "workingset" used by some 3rd party k8s monitoring tools?
在 2022/11/29 4:01, Yang Shi 写道:
On Sat, Nov 26, 2022 at 5:10 AM Yongqiang Liu <liuyongqiang13@xxxxxxxxxx> wrote:Thanks!
Hi,This should be caused by the deferred split of THP. When MADV_DONTNEED
We use mm_counter to how much a process physical memory used. Meanwhile,
page_counter of a memcg is used to count how much a cgroup physical
memory used.
If a cgroup only contains a process, they looks almost the same. But with
THP enabled, sometimes memory.usage_in_bytes in memcg may be twice or
more than rss
in proc/[pid]/smaps_rollup as follow:
[root@localhost sda]# cat /sys/fs/cgroup/memory/test/memory.usage_in_bytes
1080930304
[root@localhost sda]# cat /sys/fs/cgroup/memory/test/cgroup.procs
1290
[root@localhost sda]# cat /proc/1290/smaps_rollup
55ba80600000-ffffffffff601000 ---p 00000000 00:00 0
[rollup]
Rss: 500648 kB
Pss: 498337 kB
Shared_Clean: 2732 kB
Shared_Dirty: 0 kB
Private_Clean: 364 kB
Private_Dirty: 497552 kB
Referenced: 500648 kB
Anonymous: 492016 kB
LazyFree: 0 kB
AnonHugePages: 129024 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 0
I have found the differences was because that __split_huge_pmd decrease
the mm_counter but page_counter in memcg was not decreased with refcount
of a head page is not zero. Here are the follows:
do_madvise
madvise_dontneed_free
zap_page_range
unmap_single_vma
zap_pud_range
zap_pmd_range
__split_huge_pmd
__split_huge_pmd_locked
__mod_lruvec_page_state
zap_pte_range
add_mm_rss_vec
add_mm_counter -> decrease the
mm_counter
tlb_finish_mmu
arch_tlb_finish_mmu
tlb_flush_mmu_free
free_pages_and_swap_cache
release_pages
folio_put_testzero(page) -> not zero, skip
continue;
__folio_put_large
free_transhuge_page
free_compound_page
mem_cgroup_uncharge
page_counter_uncharge -> decrease the
page_counter
node_page_stat which shows in meminfo was also decreased. the
__split_huge_pmd
seems free no physical memory unless the total THP was free.I am
confused which
one is the true physical memory used of a process.
is called on the partial of the map, the huge PMD is split, but the
THP itself will not be split until the memory pressure is hit (global
or memcg limit). So the unmapped sub pages are actually not freed
until that point. So the mm counter is decreased due to the zapping
but the physical pages are not actually freed then uncharged from
memcg.
I don't know how much memory a real workload will cost.So I just
test the max_usage_in_bytes of memcg with THP disabled and add a little bit
more for the limit_in_byte of memcg with THP enabled which trigger a oom...
(actually it costed 100M more with THP enabled). Another testcase which I
known the amout of memory will cost don't trigger a oom with suitable
memcg limit and I see the THP split when the memory hit the limit.
I have another concern that k8s usually use (rss - files) to estimate
I recall that depends on what monitoring tools you use, for example,
some monitoring use active_anon + active_file.
Thanks.
the memory workload but the anon_thp in the defered list chargedYes, but the deferred split shrinker should handle this quite gracefully.
in memcg will make it look higher than actucal. And it seems the
container will be killed without oom...If you have some userspace daemons which monitor the memory usage by
rss, and try to behave smarter to kill the container by looking at rss
solely, you may kill the container prematurely.
Is it suitable to add meminfo of a deferred split list of THP?We could, but I don't think of how it will be used to improve the
usecase. Any more thoughts?
.Kind regards,.
Yongqiang Liu