Re: [PATCH V1 0/3] Enable UFS MCQ support for SM8650 and SM8750

From: neil . armstrong
Date: Thu Jul 31 2025 - 04:50:39 EST


Hi,

On 30/07/2025 10:22, Ram Kumar Dwivedi wrote:
This patch series enables Multi-Circular Queue (MCQ) support for the UFS
host controller on Qualcomm SM8650 and SM8750 platforms. MCQ is a modern
queuing model that improves performance and scalability by allowing
multiple hardware queues.

Although MCQ support has been present in the UFS driver for several years,
this is the first time it is being enabled via Device Tree for these
platforms.

Patch 1 updates the device tree bindings to allow the additional register
regions and reg-names required for MCQ operation.

Patches 2 and 3 update the device trees for SM8650 and SM8750 respectively
to enable MCQ by adding the necessary register mappings and MSI parent.

Tested on internal hardware for both platforms.

Palash Kambar (1):
arm64: dts: qcom: sm8750: Enable MCQ support for UFS controller

Ram Kumar Dwivedi (2):
dt-bindings: ufs: qcom: Add MCQ support to reg and reg-names
arm64: dts: qcom: sm8650: Enable MCQ support for UFS controller

.../devicetree/bindings/ufs/qcom,ufs.yaml | 21 ++++++++++++-------
arch/arm64/boot/dts/qcom/sm8650.dtsi | 9 +++++++-
arch/arm64/boot/dts/qcom/sm8750.dtsi | 10 +++++++--
3 files changed, 29 insertions(+), 11 deletions(-)


I ran some tests on the SM8650-QRD, and it works so please add my:
Tested-by: Neil Armstrong <neil.armstrong@xxxxxxxxxx> # on SM8650-QRD

I ran some fio tests, comparing the v6.15, v6.16 (with threaded irqs)
and next + mcq support, and here's the analysis on the results:

Significant Performance Gains in Write Operations with Multiple Jobs:
The "mcq" change shows a substantial improvement in both IOPS and bandwidth for write operations with 8 jobs.
Moderate Improvement in Single Job Operations (Read and Write):
For single job operations (read and write), the "mcq" change generally leads to positive, albeit less dramatic, improvements in IOPS and bandwidth.
Slight Decrease in Read Operations with Multiple Jobs:
Interestingly, for read operations with 8 jobs, there's a slight decrease in both IOPS and bandwidth with the "mcq" kernel.

The raw results are:
Board: sm8650-qrd

read / 1 job
v6.15 v6.16 next+mcq
iops (min) 3,996.00 5,921.60 4,661.20
iops (max) 4,772.80 6,491.20 5,027.60
iops (avg) 4,526.25 6,295.31 4,979.81
cpu % usr 4.62 2.96 5.68
cpu % sys 21.45 17.88 25.58
bw (MB/s) 18.54 25.78 20.40

read / 8 job
v6.15 v6.16 next+mcq
iops (min) 51,867.60 51,575.40 56,818.40
iops (max) 67,513.60 64,456.40 65,379.60
iops (avg) 64,314.80 62,136.76 63,016.07
cpu % usr 3.98 3.72 3.85
cpu % sys 16.70 17.16 14.87
bw (MB/s) 263.60 254.40 258.20

write / 1 job
v6.15 v6.16 next+mcq
iops (min) 5,654.80 8,060.00 7,117.20
iops (max) 6,720.40 8,852.00 7,706.80
iops (avg) 6,576.91 8,579.81 7,459.97
cpu % usr 7.48 3.79 6.73
cpu % sys 41.09 23.27 30.66
bw (MB/s) 26.96 35.16 30.56

write / 8 job
v6.15 v6.16 next+mcq
iops (min) 84,687.80 95,043.40 114,054.00
iops (max) 107,620.80 113,572.00 164,526.00
iops (avg) 97,910.86 105,927.38 149,071.43
cpu % usr 5.43 4.38 2.88
cpu % sys 21.73 20.29 16.09
bw (MB/s) 400.80 433.80 610.40

The test suite is:
for rw in read write ; do
echo "rw: ${rw}"
for jobs in 1 8 ; do
echo "jobs: ${jobs}"
for it in $(seq 1 5) ; do
fio --name=rand${rw} --rw=rand${rw} \
--ioengine=libaio --direct=1 \
--bs=4k --numjobs=${jobs} --size=32m \
--runtime=30 --time_based --end_fsync=1 \
--group_reporting --filename=/dev/disk/by-partlabel/super \
| grep -E '(iops|sys=|READ:|WRITE:)'
sleep 5
done
done
done

Thanks,
Neil