On Sat, Aug 01, 2020 at 12:25:40PM +0200, Donald Buczek wrote:
On 01.08.20 00:32, Dave Chinner wrote:
On Fri, Jul 31, 2020 at 01:27:31PM +0200, Donald Buczek wrote:
Dear Linux people,.....
we have a backup server with two xfs filesystems on 101.9TB md-raid6 devices (16 * 7.3 T disks) each, Current Linux version is 5.4.54.
root:done:/home/buczek/linux_problems/shrinker_semaphore/# cat /proc/meminfo
MemTotal: 263572332 kB
256GB of RAM.
MemFree: 2872368 kB
MemAvailable: 204193824 kB
200GB "available"
Buffers: 2568 kB
Cached: 164931356 kB
160GB in page cache
KReclaimable: 40079660 kB
Slab: 49988268 kB
SReclaimable: 40079660 kB
40GB in reclaimable slab objects.
IOWs, you have no free memory in the machine and so allocation
will frequently be dipping into memory reclaim to free up page cache
and slab caches to make memory available.
xfs_inode 30978282 31196832 960 4 1 : tunables 54 27 8 : slabdata 7799208 7799208 434
Yes, 30 million cached inodes.
bio_integrity_payload 29644966 30203481 192 21 1 : tunables 120 60 8 : slabdata 1438261 1438261 480
Either there is a memory leak in this slab, or it is shared with
something like the xfs_ili slab, which would indicate that most
of the cached inodes have been dirtied in memory at some point in
time.
I think you are right here:
crash> p $s->name
$84 = 0xffffffff82259401 "bio_integrity_payload"
crash> p $s->refcount
$88 = 8
crash> p $s
$92 = (struct kmem_cache *) 0xffff88bff92d2bc0
crash> p sizeof(xfs_inode_log_item_t)
$93 = 192
crash> p $s->object_size
$94 = 192
So if I understand you correctly, this is expected behavior with
this kind of load and conceptual changes are already scheduled for
kernel 5.9. I don't understand most of it, but isn't it true that
with that planned changes the impact might be better limited to
the filesystem, so that the performance of other areas of the
system might improve?
What the changes in 5.9 will do is remove the direct memory reclaim
latency that comes from waiting on IO in the shrinker. Hence you
will no longer see this problem from applications doing memory
allocation. i.e. they'll get some other memory reclaimed without
blocking (e.g. page cache or clean inodes) and so the specific
symptom of having large numbers of dirty inodes in memory that you
are seeing will go away.
Which means that dirty inodes in memory will continue to build up
until the next constraint is hit, and then it will go back to having
unpredictable large latencies while waiting for inodes to be written
back to free up whatever resource the filesystem has run out of.
That resource will, most likely, be filesystem journal space. Every
fs modification needs to reserve sufficient journal to complete
before the modification starts. Hence if the journal fills, any
modification to the fs will block waiting on dirty inode writeback
to release space in the journal....
You might be lucky and the backup process is slow enough that the
disk subsystem can keep up with the rate of ingest of new data and
so you never hit this limitation. However, the reported state of the
machine and the amount of RAM it has for caching says to me that the
underlying problem is that ingest is far faster than the filesystem
and disk subsystem can sink...
A solution to this problem might be to spread out the backups being
done over a wider timeframe, such that there isn't a sustained heavy
load at 3am in the morning when every machine is scheduled to be
backed up at the same time...
I'd love to test that with our load, but I
don't want to risk our backup data and it would be difficult to
produce the same load on a toy system. The patch set is not yet
ready to be tested on production data, is it?
Not unless you like testing -rc1 kernels in production :)
Cheers,
Dave.