RE: Re: [PATCH v14 0/3] scsi: ufs: Add Host Performance Booster Support

From: Daejun Park
Date: Thu Dec 17 2020 - 20:06:11 EST


Hi, Greg

> > NAND flash memory-based storage devices use Flash Translation Layer (FTL)
> > to translate logical addresses of I/O requests to corresponding flash
> > memory addresses. Mobile storage devices typically have RAM with
> > constrained size, thus lack in memory to keep the whole mapping table.
> > Therefore, mapping tables are partially retrieved from NAND flash on
> > demand, causing random-read performance degradation.
> >
> > To improve random read performance, JESD220-3 (HPB v1.0) proposes HPB
> > (Host Performance Booster) which uses host system memory as a cache for the
> > FTL mapping table. By using HPB, FTL data can be read from host memory
> > faster than from NAND flash memory.
> >
> > The current version only supports the DCM (device control mode).
> > This patch consists of 3 parts to support HPB feature.
> >
> > 1) HPB probe and initialization process
> > 2) READ -> HPB READ using cached map information
> > 3) L2P (logical to physical) map management
> >
> > In the HPB probe and init process, the device information of the UFS is
> > queried. After checking supported features, the data structure for the HPB
> > is initialized according to the device information.
> >
> > A read I/O in the active sub-region where the map is cached is changed to
> > HPB READ by the HPB.
> >
> > The HPB manages the L2P map using information received from the
> > device. For active sub-region, the HPB caches through ufshpb_map
> > request. For the in-active region, the HPB discards the L2P map.
> > When a write I/O occurs in an active sub-region area, associated dirty
> > bitmap checked as dirty for preventing stale read.
> >
> > HPB is shown to have a performance improvement of 58 - 67% for random read
> > workload. [1]
> >
> > We measured the total start-up time of popular applications and observed
> > the difference by enabling the HPB.
> > Popular applications are 12 game apps and 24 non-game apps. Each target
> > applications were launched in order. The cycle consists of running 36
> > applications in sequence. We repeated the cycle for observing performance
> > improvement by L2P mapping cache hit in HPB.
> >
> > The Following is experiment environment:
> > - kernel version: 4.4.0
> > - UFS 2.1 (64GB)
> >
> > Result:
> > +-------+----------+----------+-------+
> > | cycle | baseline | with HPB | diff |
> > +-------+----------+----------+-------+
> > | 1 | 272.4 | 264.9 | -7.5 |
> > | 2 | 250.4 | 248.2 | -2.2 |
> > | 3 | 226.2 | 215.6 | -10.6 |
> > | 4 | 230.6 | 214.8 | -15.8 |
> > | 5 | 232.0 | 218.1 | -13.9 |
> > | 6 | 231.9 | 212.6 | -19.3 |
> > +-------+----------+----------+-------+
>
> I feel this was burried in the 00 email, shouldn't it go into the 01
> commit changelog so that you can see this?

Sure, I will move this result to 01 commit log.

> But why does the "cycle" matter here?

I think iteration minimizes other factors that affect the start-up time of
application.

> Can you run a normal benchmark, like fio, on here so we can get some
> numbers we know how to compare to other systems with, and possible
> reproduct it ourselves? I'm sure fio will easily show random read
> performance increases, right?

Here is my iozone script:
iozone -r 4k -+n -i2 -ecI -t 16 -l 16 -u 16
-s $IO_RANGE/16 -F mnt/tmp_1 mnt/tmp_2 mnt/tmp_3 mnt/tmp_4
mnt/tmp_5 mnt/tmp_6 mnt/tmp_7 mnt/tmp_8 mnt/tmp_9 mnt/tmp_10 mnt/tmp_11
mnt/tmp_12 mnt/tmp_13 mnt/tmp_14 mnt/tmp_15 mnt/tmp_16

Result:
+----------+--------+---------+
| IO range | HPB on | HPB off |
+----------+--------+---------+
| 1 GB | 294.8 | 300.87 |
| 4 GB | 293.51 | 179.35 |
| 8 GB | 294.85 | 162.52 |
| 16 GB | 293.45 | 156.26 |
| 32 GB | 277.4 | 153.25 |
+----------+--------+---------+

Thanks,
Daejun