Re: Regarding patch "block/blk-mq: Don't complete locally if capacities are different"
From: Qais Yousef
Date: Sun Sep 01 2024 - 13:25:46 EST
On 08/13/24 17:20, Christian Loehle wrote:
> On 8/9/24 01:23, Qais Yousef wrote:
> > On 08/05/24 11:18, Christian Loehle wrote:
> >
> >>> My understanding of rq_affinity=1 is to match the perf of requester. Given that
> >>> the characteristic of HMP system is that power has an equal importance to perf
> >>> (I think this now has become true for all systems by the way), saying that the
> >>> match in one direction is better than the other is sort of forcing a policy of
> >>> perf first which I don't think is a good thing to enforce. We don't have enough
> >>> info to decide at this level. And our users care about both.
> >>
> >> I would argue rq_affinity=1 matches the perf, so that flag should already bias
> >> perf in favor of power slightly?
> >
> > Not on this type of systems. If perf was the only thing important, just use
> > equally big cpus. Balancing perf and power is important on those systems, and
> > I don't think we have enough info to decide which decision is best when
> > capacities are not the same. Matching the perf level the requesting on makes
> > sense when irq_affinity=1.
>
> Well you could still want a
> "IO performance always beats power considerations" and still go HMP because
> sometimes for non-IO you prefer power, but I agree that we don't have enough
> information about what the user wants from the system/kernel.
The flag was to keep requester and completion on the same LLC. Note the keyword
same. Assume L3 don't exist and L2 is LLC which means capacity level and LLC
mapping to cluster is the same.
Then note that the capability of the LLC is different since the big clusters
tend to have larger L2. I think modern SoCs for servers can end up with complex
LLC not all of which have the same performance.
Keeping requester/completion on the same capacity keeps the old behavior. To
teach it to do more then we need sensible extensions based on sensible use
cases. And not hack rq_affinity=1 to do complex things.
>
> >
> >> Although the actual effect on power probably isn't that significant, given
> >> that the (e.g. big) CPU has submitted the IO, is woken up soon, so you could
> >> almost ignore a potential idle wakeup and the actual CPU time spent in the block
> >> completion is pretty short of course.
> >>
> >>> If no matching is required, it makes sense to set rq_affinity to 0. When
> >>> matching is enabled, we need to rely on per-task iowait boost to help the
> >>> requester to run at a bigger CPU, and naturally the completion will follow when
> >>> rq_affinity=1. If the requester doesn't need the big perf, but the irq
> >>> triggered on a bigger core, I struggle to understand why it is good for
> >>> completion to run on bigger core without the requester also being on a similar
> >>> bigger core to truly maximize perf.
> >>
> >> So first of all, per-task iowait boosting has nothing to do with it IMO.
> >
> > It has. If the perf is not good because the requester is running on little
> > core, the requester need to move up to ensure the overall IO perf is better.
>
> See below but also
> "the requester need to move up to ensure the overall IO perf is better" is
> just not true, with asynchronous IO submission done right, the submission
I think it is true for rq_affinity=1, which is the context of the discussion
and the patch.
> runtime isn't critical to the IO throughput, therefore it should run the
> most power-efficient way.
> This can be observed e.g. with any io_uring fio workload with significant
> iodepth (and possibly multi-threading).
> Completion may be a different story, depending on the device stack, if we're
> dealing with !MCQ then the completion path (irq + block layer completion)
> is absolutely critical.
> For any mmc / ufs<4.0 system the performance difference between
> fio --name=little --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$LITTLE_CPUS
> and
> fio --name=big --filename=/dev/sda --runtime=10 --rw=randread --bs=4k --ioengine=io_uring --numjobs=4 --iodepth=32 --group_reporting --cpus_allowed=$BIG_CPUS
> is (usually) only because of the completion path and setting irq affinity of
> /dev/sda to $BIG_CPUS will make the difference disappear (rq_affinity=0 and
> implying LLC is the same).
> Running the submission on little CPUs will usually be the most power-efficient
> way then.
>From rq_affinity=1 context, how it is supposed to discern all of that? If
there's a good answer to this, then this is the direction that should be taken
to handle it transparently. AFAICT we don't have this info.
>
> >
> >> Plenty of IO workloads build up utilization perfectly fine.
> >
> > These ones have no problems, no? They should migrate to big core and the
> > completion will follow them when they move.
>
> So if I understood Manish correctly the only reason they want the completion
> to run on a bigger CPU than the submission is because the submission is already
> saturating the CPU, therefore utilization of submission is no issue whatsoever.
> They don't want to run (submission) on big though because of power
> considerations.
Is this what rq_affinity=1 means? This use case is rq_affinity=0. IOW, custom
affinity management.
>
> >
> >> I wouldn't consider the setup: requester little perf, irq+completion big perf
> >> invalid necessarily, it does decrease IO latency for the application.
> >
> > I didn't say invalid. But it is not something we can guess automatically when
> > irq_affinity=1. We don't have enough info to judge. The only info we have the
> > requester that originated the request is running at different perf level
> > (whther higher or lower), so we follow it.
> >
> Anyway, Manish's problem should be solved by rq_affinity=0 in that case (with
> irq affinities set to big CPU then the completion will be run on the irq CPU)
> and "rq_affinity=1 <=> equal capacity CPU" is the correct interpretation, is that
> more or less agreed upon now?
>
I think this is the sensible route. The sensible extensions I foresee is
teaching how to discern different cases rather than add a hacky flag to confuse
what rq_affinity=1 means.