Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

From: Dave Chinner
Date: Thu Mar 22 2018 - 01:02:37 EST


On Wed, Mar 21, 2018 at 07:15:14PM +0300, Kirill Tkhai wrote:
> On 20.03.2018 17:34, Dave Chinner wrote:
> > On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
> >> On 20.03.2018 03:18, Dave Chinner wrote:
> >>> On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
> >>>> On 17.03.2018 00:39, Dave Chinner wrote:
> >>> Actually, it is fair, because:
> >>>
> >>> /* proportion the scan between the caches */
> >>> dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> >>> inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> >>> fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
> >>>
> >>> This means that if the number of objects in the memcg aware VFS
> >>> caches are tiny compared to the global XFS inode cache, then they
> >>> only get a tiny amount of the total scanning done by the shrinker.
> >>
> >> This is just wrong. If you have two memcgs, the first is charged
> >> by xfs dentries and inodes, while the second is not charged by xfs
> >> dentries and inodes, the second will response for xfs shrink as well
> >> as the first.
> >
> > That makes no sense to me. Shrinkers are per-sb, so they only
> > shrink per-filesystem dentries and inodes. If your memcgs are on
> > different filesystems, then they'll behave according to the size of
> > that filesystem's caches, not the cache of some other filesystem.
>
> But this break the isolation purpose of memcg. When someone places
> different services in different pairs of memcg and cpu cgroup, he
> wants do not do foreign work.

Too bad. Filesystems *break memcg isolation all the time*.

Think about it. The filesystem journal is owned by the filesystem,
not the memcg that is writing a transaction to it. If one memcg
does lots of metadata modifications, it can use all the journal
space and starve iitself and all other memcgs of journal space while
the filesystem does writeback to clear it out.

Another example: if one memcg is manipulating free space in a
particular allocation group, other memcgs that need to manipulate
space in those allocation groups will block and be prevented from
operating until the other memcg finishes it's work.

Another example: inode IO in XFS are done in clusters of 32. read or
write any inode in the cluster, and we read/write all the other
inodes in that cluster, too. Hence if we need to read an inode and
the cluster is busy because the filesystem journal needed flushing
or another inode in the cluster is being unlinked (e.g. by a process
in a different memcg), then the read in the first memcg will block
until whatever is being done on the buffer is complete. IOWs, even
at the physical on-disk inode level we violate memcg resource
isolation principles.

I can go on, but I think you get the picture: Filesystems are full
of global structures whose access arbitration mechanisms
fundamentally break the concept of operational memcg isolation.

With that in mind, we really don't care that the shrinker does
global work and violate memcg isolation principles because we
violately them everywhere. IOWs, if we try to enforce memcg
isolation in the shrinker, then we can't guarantee forwards progress
in memory reclaim because we end up with multi-dimensional memcg
dependencies at the physical layers of the filesystem structure....

I don't expect people who know nothing about XFS or filesystems to
understand the complexity of the interactions we are dealing with
here. Everything is a compromise when it comes to the memory reclaim
code as tehre are so many corner cases we have to handle. In this
situation, perfect is the enemy of good....

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx