It would be good to understand why the rcache doesn't stabilize. Could be
a bug, or just need some tuning
In strict mode, if a driver does Alloc-Free-Alloc and the first alloc
misses the rcache, the second allocation hits it. The same sequence in
non-strict mode misses the cache twice, because the IOVA is added to the
flush queue on Free.
So rather than AFAFAF.. we get AAA..FFF.., only once the fq_timer triggers
or the FQ is full.
Interestingly the FQ size is 2x IOVA_MAG_SIZE, so we
could allocate 2 magazines worth of fresh IOVAs before alloc starts
hitting the cache. If a job allocates more than that, some magazines are
going to the depot, and with multi-CPU jobs those will get used on other
CPUs during the next alloc bursts, causing the progressive increase in
rcache consumption. I wonder if setting IOVA_MAG_SIZE > IOVA_FQ_SIZE helps
reuse of IOVAs?
Then again I haven't worked out the details, might be entirely wrong. I'll
have another look next week.
I did start digging into the data (thanks for that!) before Christmas, but between being generally frazzled and trying to remember how to write Perl to massage the numbers out of the log dump I never got round to responding, sorry.
The partial thoughts that I can recall right now are firstly that the total numbers of IOVAs are actually pretty meaningless, it really needs to be broken down by size (that's where my Perl-hacking stalled...); secondly that the pattern is far more than just a steady increase - the CPU rcache count looks to be heading asymptotically towards ~65K IOVAs all the time, representing (IIRC) two sizes being in heavy rotation, while the depot is happily ticking along in a steady state as expected, until it suddenly explodes out of nowhere; thirdly, I'd really like to see instrumentation of the flush queues at the same time, since I think they're the real culprit.
My theory so far is that everyone is calling queue_iova() frequently enough to keep the timer at bay and their own queues drained. Then at the ~16H mark, *something* happens that pauses unmaps long enough for the timer to fire, and at that point all hell breaks loose.
The depot is suddenly flooded with IOVAs of *all* sizes, indicative of all the queues being flushed at once (note that the two most common sizes have been hovering perilously close to "full" the whole time), but then, crucially, *that keeps happening*. My guess is that the load of fq_flush_timeout() slows things down enough that the the timer then keeps getting the chance to expire and repeat the situation.
The main conclusion I draw from this is the same one that was my initial gut feeling; that MAX_GLOBAL_MAGS = 32 is utter bollocks.
The CPU rcache capacity scales with the number of CPUs; the flush queue capacity scales with the number of CPUs; it is nonsensical that the depot size does not correspondingly scale with the number of CPUs (I note that the testing on the original patchset cites a 16-CPU system, where that depot capacity is conveniently equal to the total rcache capacity).
Now yes, purging the rcaches when the depot is full does indeed help mitigate this scenario - I assume it provides enough of a buffer where the regular free_iova_fast() calls don't hit queue_iova() for a while (and gives fq_ring_free() some reprieve on the CPU handling the timeout), giving enough leeway for the flood to finish before anyone starts hitting queues/locks/etc. and stalling again, and thus break the self-perpetuating timeout cycle. But that's still only a damage limitation exercise! It's planning for failure to just lie down and assume that the depot is going to be full if fq_flush_timeout() ever fires because it's something like an order of magnitude smaller than the flush queue capacity (even for a uniform distribution of IOVA sizes) on super-large systems.
I'm honestly tempted to move my position further towards a hard NAK on this approach, because all the evidence so far points to it being a bodge around a clear and easily-fixed scalability oversight. At the very least I'd now want to hear a reasoned justification for why you want to keep the depot at an arbitrary fixed size while the whole rest of the system scales
(I'm assuming that since my previous suggestion to try changes in that area seems to have been ignored).