Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

From: Kirill Tkhai
Date: Thu Mar 22 2018 - 12:52:51 EST


On 22.03.2018 08:01, Dave Chinner wrote:
> On Wed, Mar 21, 2018 at 07:15:14PM +0300, Kirill Tkhai wrote:
>> On 20.03.2018 17:34, Dave Chinner wrote:
>>> On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
>>>> On 20.03.2018 03:18, Dave Chinner wrote:
>>>>> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
>>>>>> On 17.03.2018 00:39, Dave Chinner wrote:
>>>>> Actually, it is fair, because:
>>>>>
>>>>> /* proportion the scan between the caches */
>>>>> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
>>>>> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
>>>>> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
>>>>>
>>>>> This means that if the number of objects in the memcg aware VFS
>>>>> caches are tiny compared to the global XFS inode cache, then they
>>>>> only get a tiny amount of the total scanning done by the shrinker.
>>>>
>>>> This is just wrong. If you have two memcgs, the first is charged
>>>> by xfs dentries and inodes, while the second is not charged by xfs
>>>> dentries and inodes, the second will response for xfs shrink as well
>>>> as the first.
>>>
>>> That makes no sense to me. Shrinkers are per-sb, so they only
>>> shrink per-filesystem dentries and inodes. If your memcgs are on
>>> different filesystems, then they'll behave according to the size of
>>> that filesystem's caches, not the cache of some other filesystem.
>>
>> But this break the isolation purpose of memcg. When someone places
>> different services in different pairs of memcg and cpu cgroup, he
>> wants do not do foreign work.
>
> Too bad. Filesystems *break memcg isolation all the time*.
>
> Think about it. The filesystem journal is owned by the filesystem,
> not the memcg that is writing a transaction to it. If one memcg
> does lots of metadata modifications, it can use all the journal
> space and starve iitself and all other memcgs of journal space while
> the filesystem does writeback to clear it out.
>
> Another example: if one memcg is manipulating free space in a
> particular allocation group, other memcgs that need to manipulate
> space in those allocation groups will block and be prevented from
> operating until the other memcg finishes it's work.
>
> Another example: inode IO in XFS are done in clusters of 32. read or
> write any inode in the cluster, and we read/write all the other
> inodes in that cluster, too. Hence if we need to read an inode and
> the cluster is busy because the filesystem journal needed flushing
> or another inode in the cluster is being unlinked (e.g. by a process
> in a different memcg), then the read in the first memcg will block
> until whatever is being done on the buffer is complete. IOWs, even
> at the physical on-disk inode level we violate memcg resource
> isolation principles.
>
> I can go on, but I think you get the picture: Filesystems are full
> of global structures whose access arbitration mechanisms
> fundamentally break the concept of operational memcg isolation.
>
> With that in mind, we really don't care that the shrinker does
> global work and violate memcg isolation principles because we
> violately them everywhere. IOWs, if we try to enforce memcg
> isolation in the shrinker, then we can't guarantee forwards progress
> in memory reclaim because we end up with multi-dimensional memcg
> dependencies at the physical layers of the filesystem structure....

Here is the problem I'm solving: https://lkml.org/lkml/2018/3/21/365.
Current shrinker is not scalable. Then there are many memcg and mounts,
the time of iteration shrink_slab() in case of global reclaim can
take much time. There is times of shrink_slab() by the link. A node
with 200 containers may waste 4 seconds on global reclaim just to
iterate over all shrinkers for all cgroups, call shrinker::count_objects()
and receive 0 zero objects.

Can't we call shrink of shared objects only for top memcg? Something like
this:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fcd9f8d7390..13429366c276 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2569,6 +2569,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

+ /* Called only for top memcg */
+ shrink_shared_objects();
+
+
if (global_reclaim(sc))
shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
sc->priority);

"Top memcg" means a memcg, which meets reclaim. It is root_mem_cgroup in case of
global reclaim; and it's task's current memcg if there is memcg reclaim.

> I don't expect people who know nothing about XFS or filesystems to
> understand the complexity of the interactions we are dealing with
> here. Everything is a compromise when it comes to the memory reclaim
> code as tehre are so many corner cases we have to handle. In this
> situation, perfect is the enemy of good...