Re: [PATCH 00/16 v3] f2fs: introduce flash-friendly file system

From: Arnd Bergmann
Date: Mon Nov 12 2012 - 11:57:03 EST


On Monday 12 November 2012, Martin Steigerwald wrote:
> Am Samstag, 10. November 2012 schrieb Arnd Bergmann:

> > I would also recommend using flashbench to find out the optimum parameters
> > for your device. You can download it from
> > git://git.linaro.org/people/arnd/flashbench.git
> > In the long run, we should automate those tests and make them part of
> > mkfs.f2fs, but for now, try to find out the erase block size and the number
> > of concurrently used erase blocks on your device using a timing attack
> > in flashbench. The README file in there explains how to interpret the
> > results from "./flashbench -a /dev/sdb --blocksize=1024" to guess
> > the erase block size, although that sometimes doesn't work.
>
> Why do I use a blocksize of 1024 if the kernel reports me 512 byte blocks?

The blocksize you pass here is the size of writes that flashbench sends to the
kernel. Because of the algorithm used by flashbench, two hardware blocks
is the smallest size you can use here, and larger block tend to be less reliable
for this test case. I should probably change the default.

> [ 3112.144086] scsi9 : usb-storage 1-1.1:1.0
> [ 3113.145968] scsi 9:0:0:0: Direct-Access TinyDisk 2007-05-12 0.00 PQ: 0 ANSI: 2
> [ 3113.146476] sd 9:0:0:0: Attached scsi generic sg2 type 0
> [ 3113.147935] sd 9:0:0:0: [sdb] 4095999 512-byte logical blocks: (2.09 GB/1.95 GiB)
> [ 3113.148935] sd 9:0:0:0: [sdb] Write Protect is off
>
>
> And how do reads give information about erase block size? WouldnÂt writes me
> more conclusive for that? (Having to erase one versus two erase blocks?)

The --open-au tests can be more reliable, but also take more time and are
harder to understand. Using this test is faster and often gives an easy
answer even without destroying data on the device.


> Hmmm, I get very varying results here with said USB stick:
>
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 536870912 pre 1.1ms on 1.1ms post 1.08ms diff 13Âs
> align 268435456 pre 1.2ms on 1.19ms post 1.16ms diff 11.6Âs
> align 134217728 pre 1.12ms on 1.14ms post 1.15ms diff 9.51Âs
> align 67108864 pre 1.12ms on 1.15ms post 1.12ms diff 29.9Âs
> align 33554432 pre 1.11ms on 1.17ms post 1.13ms diff 49Âs
> align 16777216 pre 1.14ms on 1.16ms post 1.15ms diff 22.4Âs
> align 8388608 pre 1.12ms on 1.09ms post 1.06ms diff -2053ns
> align 4194304 pre 1.13ms on 1.16ms post 1.14ms diff 21.7Âs
> align 2097152 pre 1.11ms on 1.08ms post 1.1ms diff -18488n
> align 1048576 pre 1.11ms on 1.11ms post 1.11ms diff -2461ns
> align 524288 pre 1.15ms on 1.17ms post 1.1ms diff 45.4Âs
> align 262144 pre 1.11ms on 1.13ms post 1.13ms diff 12Âs
> align 131072 pre 1.1ms on 1.09ms post 1.16ms diff -38025n
> align 65536 pre 1.09ms on 1.08ms post 1.11ms diff -21353n
> align 32768 pre 1.1ms on 1.08ms post 1.11ms diff -23854n
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 536870912 pre 1.11ms on 1.13ms post 1.13ms diff 10.6Âs
> align 268435456 pre 1.12ms on 1.2ms post 1.17ms diff 61.4Âs
> align 134217728 pre 1.14ms on 1.19ms post 1.15ms diff 46.8Âs
> align 67108864 pre 1.08ms on 1.15ms post 1.08ms diff 63.8Âs
> align 33554432 pre 1.09ms on 1.08ms post 1.09ms diff -4761ns
> align 16777216 pre 1.12ms on 1.14ms post 1.07ms diff 41.4Âs
> align 8388608 pre 1.1ms on 1.1ms post 1.09ms diff 7.48Âs
> align 4194304 pre 1.08ms on 1.1ms post 1.1ms diff 10.1Âs
> align 2097152 pre 1.1ms on 1.11ms post 1.1ms diff 16Âs
> align 1048576 pre 1.09ms on 1.1ms post 1.07ms diff 15.5Âs
> align 524288 pre 1.12ms on 1.12ms post 1.1ms diff 11Âs
> align 262144 pre 1.13ms on 1.13ms post 1.1ms diff 21.6Âs
> align 131072 pre 1.11ms on 1.13ms post 1.12ms diff 17.9Âs
> align 65536 pre 1.07ms on 1.1ms post 1.1ms diff 11.6Âs
> align 32768 pre 1.09ms on 1.11ms post 1.13ms diff -5131ns
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 536870912 pre 1.2ms on 1.18ms post 1.21ms diff -27496n
> align 268435456 pre 1.22ms on 1.21ms post 1.24ms diff -18972n
> align 134217728 pre 1.15ms on 1.19ms post 1.14ms diff 42.5Âs
> align 67108864 pre 1.08ms on 1.09ms post 1.08ms diff 5.29Âs
> align 33554432 pre 1.18ms on 1.19ms post 1.18ms diff 9.25Âs
> align 16777216 pre 1.18ms on 1.22ms post 1.17ms diff 48.6Âs
> align 8388608 pre 1.14ms on 1.17ms post 1.19ms diff 4.36Âs
> align 4194304 pre 1.16ms on 1.2ms post 1.11ms diff 65.8Âs
> align 2097152 pre 1.13ms on 1.09ms post 1.12ms diff -37718n
> align 1048576 pre 1.15ms on 1.2ms post 1.18ms diff 34.9Âs
> align 524288 pre 1.14ms on 1.19ms post 1.16ms diff 41.5Âs
> align 262144 pre 1.19ms on 1.12ms post 1.15ms diff -52725n
> align 131072 pre 1.21ms on 1.11ms post 1.14ms diff -68522n
> align 65536 pre 1.21ms on 1.13ms post 1.18ms diff -64248n
> align 32768 pre 1.14ms on 1.25ms post 1.12ms diff 116Âs
>
> Even when I apply the explaination of the README I do not seem to get a
> clear picture of the stick erase block size.
>
> The values above seem to indicate to me: I donÂt care about alignment at all.

I think it's more a case of a device where reading does not easily reveal
the erase block boundaries, because the variance between multiple reads
is much higher than between different positions. You can try again using
"--blocksize=1024 --count=100", which will increase the accuracy of the
test.

On the other hand, the device size of "4095999 512-byte logical blocks"
is quite suspicious, because it's not an even number, where it should
be a multiple of erase blocks. It is one less sector than 1000 2MB blocks
(or 500 4MB blocks, for that matter), but it's not clear if that one
block is missing at the start or at the end of the drive.

> With another flash, likely slower Intenso 4GB stick I get:
>
> [ 3672.512143] scsi 10:0:0:0: Direct-Access Ut165 USB2FlashStorage 0.00 PQ: 0 ANSI: 2
> [ 3672.514469] sd 10:0:0:0: Attached scsi generic sg2 type 0
> [ 3672.514991] sd 10:0:0:0: [sdb] 7897088 512-byte logical blocks: (4.04 GB/3.76 GiB)
> [â]

$ factor 7897088
7897088: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 241

Slightly more helpful, this one has 241 32MB-blocks, so at least we know that the
erase block size is not larger than 32MB (which would be very unlikely anyway)
and not a multiple of 3.

> align 16777216 pre 939Âs on 903Âs post 880Âs diff -5972ns
> align 8388608 pre 900Âs on 914Âs post 923Âs diff 2.42Âs
> align 4194304 pre 894Âs on 886Âs post 882Âs diff -1563ns
>
> here?
>
> align 2097152 pre 829Âs on 890Âs post 874Âs diff 37.8Âs
> align 1048576 pre 899Âs on 882Âs post 843Âs diff 11.1Âs
> align 524288 pre 890Âs on 887Âs post 902Âs diff -9005ns
> align 262144 pre 887Âs on 887Âs post 898Âs diff -5474ns
> align 131072 pre 928Âs on 895Âs post 914Âs diff -26028n
> align 65536 pre 898Âs on 898Âs post 894Âs diff 2.59Âs
> align 32768 pre 884Âs on 891Âs post 901Âs diff -1284ns
>
>
> Similar picture. The diffs seem to be mostly quite small with only some
> micro seconds. Or am I misreading something?

Same thing, try again with the options I listed above.

> Then with a quite fast one 16 GB Transcend.
>
> [ 4055.393399] sd 11:0:0:0: Attached scsi generic sg2 type 0
> [ 4055.394729] sd 11:0:0:0: [sdb] 31375360 512-byte logical blocks: (16.0 GB/14.9 GiB)
> [ 4055.395262] sd 11:0:0:0: [sdb] Write Protect is off

$ factor 31375360
31375360: 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 383

That would be 5*383*16MB, so the erase block size will be a fraction of 16MB.

> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 4294967296 pre 1.28ms on 1.48ms post 1.33ms diff 179Âs
> align 2147483648 pre 1.32ms on 1.51ms post 1.33ms diff 181Âs
> align 1073741824 pre 1.31ms on 1.46ms post 1.35ms diff 132Âs
> align 536870912 pre 1.27ms on 1.52ms post 1.33ms diff 228Âs
> align 268435456 pre 1.28ms on 1.46ms post 1.31ms diff 161Âs
> align 134217728 pre 1.28ms on 1.44ms post 1.37ms diff 120Âs
> align 67108864 pre 1.27ms on 1.44ms post 1.34ms diff 133Âs
> align 33554432 pre 1.24ms on 1.42ms post 1.31ms diff 150Âs
> align 16777216 pre 1.23ms on 1.46ms post 1.26ms diff 218Âs
> align 8388608 pre 1.31ms on 1.5ms post 1.33ms diff 180Âs
> align 4194304 pre 1.27ms on 1.45ms post 1.36ms diff 135Âs
> align 2097152 pre 1.29ms on 1.37ms post 1.39ms diff 33.7Âs
>
> here?
>
> align 1048576 pre 1.31ms on 1.44ms post 1.35ms diff 115Âs
> align 524288 pre 1.33ms on 1.39ms post 1.48ms diff -12297n
> align 262144 pre 1.36ms on 1.42ms post 1.4ms diff 45.6Âs
> align 131072 pre 1.37ms on 1.44ms post 1.4ms diff 57.7Âs
> align 65536 pre 1.36ms on 1.35ms post 1.33ms diff 4.67Âs
> align 32768 pre 1.32ms on 1.38ms post 1.34ms diff 44.1Âs
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 4294967296 pre 1.36ms on 1.49ms post 1.34ms diff 139Âs
> align 2147483648 pre 1.26ms on 1.48ms post 1.27ms diff 213Âs
> align 1073741824 pre 1.26ms on 1.45ms post 1.33ms diff 164Âs
> align 536870912 pre 1.22ms on 1.46ms post 1.35ms diff 173Âs
> align 268435456 pre 1.34ms on 1.5ms post 1.31ms diff 172Âs
> align 134217728 pre 1.34ms on 1.48ms post 1.31ms diff 157Âs
> align 67108864 pre 1.29ms on 1.46ms post 1.34ms diff 142Âs
> align 33554432 pre 1.28ms on 1.47ms post 1.31ms diff 173Âs
> align 16777216 pre 1.26ms on 1.48ms post 1.37ms diff 168Âs
> align 8388608 pre 1.31ms on 1.47ms post 1.36ms diff 139Âs
> align 4194304 pre 1.26ms on 1.53ms post 1.33ms diff 237Âs
> align 2097152 pre 1.34ms on 1.4ms post 1.36ms diff 56.4Âs
> align 1048576 pre 1.32ms on 1.35ms post 1.37ms diff 638ns
>
> here?
>
> align 524288 pre 1.29ms on 1.47ms post 1.45ms diff 98.1Âs
> align 262144 pre 1.35ms on 1.38ms post 1.42ms diff -11916n
> align 131072 pre 1.32ms on 1.46ms post 1.4ms diff 100Âs
> align 65536 pre 1.35ms on 1.42ms post 1.43ms diff 30.8Âs
> align 32768 pre 1.31ms on 1.37ms post 1.33ms diff 51Âs
> merkaba:~> /tmp/flashbench -a /dev/sdb
> align 4294967296 pre 1.26ms on 1.49ms post 1.27ms diff 222Âs
> align 2147483648 pre 1.25ms on 1.41ms post 1.37ms diff 97.3Âs
> align 1073741824 pre 1.26ms on 1.47ms post 1.31ms diff 186Âs
> align 536870912 pre 1.25ms on 1.42ms post 1.32ms diff 132Âs
> align 268435456 pre 1.2ms on 1.44ms post 1.29ms diff 195Âs
> align 134217728 pre 1.27ms on 1.43ms post 1.34ms diff 118Âs
> align 67108864 pre 1.25ms on 1.45ms post 1.31ms diff 165Âs
> align 33554432 pre 1.22ms on 1.36ms post 1.25ms diff 124Âs
> align 16777216 pre 1.24ms on 1.44ms post 1.26ms diff 191Âs
> align 8388608 pre 1.22ms on 1.39ms post 1.23ms diff 164Âs
> align 4194304 pre 1.23ms on 1.43ms post 1.3ms diff 171Âs
> align 2097152 pre 1.26ms on 1.3ms post 1.32ms diff 16.7Âs
> align 1048576 pre 1.26ms on 1.27ms post 1.26ms diff 7.91Âs
>
> here?
>
> align 524288 pre 1.24ms on 1.3ms post 1.3ms diff 29.2Âs
> align 262144 pre 1.25ms on 1.3ms post 1.28ms diff 28.2Âs
> align 131072 pre 1.25ms on 1.29ms post 1.28ms diff 24.8Âs
> align 65536 pre 1.15ms on 1.24ms post 1.26ms diff 34.5Âs
> align 32768 pre 1.17ms on 1.3ms post 1.26ms diff 82.6Âs

This one is fairly deterministic, and I would assume it's 4MB, which always
has a much higher number in the last column than the 2MB one.
For a fast 16 GB stick, I also wouldn't expect smaller than 4 MB erase blocks.

> Thing is that me here is not always at the same place :)

If you add a '--count=N' argument, you can have flashbench run the test more
often and average between the runs. The default is 8.

> > With the correct guess, compare the performance you get using
> >
> > $ ERASESIZE=$[2*1024*1024] # replace with guess from flashbench -a
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=3 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=5 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=7 --blocksize=4096 --erasesize=${ERASESIZE}
> > $ ./flashbench /dev/sdb --open-au --open-au-nr=13 --blocksize=4096 --erasesize=${ERASESIZE}
>
> I omit this for now, cause I am not yet sure about the correct guess.

You can also try this test to find out the erase block size if the -a test fails.
Start with the largest possible value you'd expect (16 MB for a modern and fast
USB stick, less if it's older or smaller), and use --open-au-nr=1 to get a baseline:

./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=4096 --erasesize=$[16*1024*1024]

Every device should be able to handle this nicely with maximum throughput. The default is
to start the test at 16 MB into the device to get out of the way of a potential FAT
optimized area. You can change that offset to find where an erase block boundary is.
Adding '--offset=[24*1024*1024]' will still be fast if the erase block size is 8 MB,
but get slower and have more jitter if the size is actually 16 MB, because now we write
a 16 MB section of the drive with an 8 MB misalignment. The next ones to try after that
would be 20, 18, 17, 16.5, etc MB, to which will be slow for an 8,4, 2, an 1 MB erase
block size, respectively. You can also reduce the --erasesize argument there and do

./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[16*1024*1024 --offset=[24*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[8*1024*1024 --offset=[20*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[4*1024*1024 --offset=[18*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[2*1024*1024 --offset=[17*1024*1024]
./flashbench /dev/sdb --open-au --open-au-nr=1 --blocksize=65536 --erasesize=[1*1024*1024 --offset=[33*512*1024]

If you have the result from the other test to figure out the maximum value for
'--open-au-nr=N', using that number here will make this test more reliable as well.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/