Re: [PATCH] bcache: consider the fragmentation when update the writeback rate

From: Dongdong Tao
Date: Fri Jan 08 2021 - 03:32:06 EST


Hi Coly,

They are captured with the same time length, the meaning of the
timestamp and the time unit on the x-axis are different.
(Sorry, I should have clarified this right after the chart)

For the latency chart:
The timestamp is the relative time since the beginning of the
benchmark, so the start timestamp is 0 and the unit is based on
millisecond

For the dirty data and cache available percent chart:
The timestamp is the UNIX timestamp, the time unit is based on second,
I capture the stats every 5 seconds with the below script:
---
#!/bin/sh
while true; do echo "`date +%s`, `cat
/sys/block/bcache0/bcache/dirty_data`, `cat
/sys/block/bcache0/bcache/cache/cache_available_percent`, `cat
/sys/block/bcache0/bcache/writeback_rate`" >> $1; sleep 5; done;
---

Unfortunately, I can't easily make them using the same timestamp, but
I guess I can try to convert the UNIX timestamp to the relative time
like the first one.
But If we ignore the value of the X-axis, we can still roughly
compare them by using the length of the X-axis since they have the
same time length,
and we can see that the Master's write start hitting the backing
device when the cache_available_percent dropped to around 30.

Regards,
Dongdong


On Fri, Jan 8, 2021 at 12:06 PM Coly Li <colyli@xxxxxxx> wrote:
>
> On 1/7/21 10:55 PM, Dongdong Tao wrote:
> > Hi Coly,
> >
> >
> > Thanks for the reminder, I understand that the rate is only a hint of
> > the throughput, it’s a value to calculate the sleep time between each
> > round of keys writeback, the higher the rate, the shorter the sleep
> > time, most of the time this means the more dirty keys it can writeback
> > in a certain amount of time before the hard disk running out of speed.
> >
> >
> > Here is the testing data that run on a 400GB NVME + 1TB NVME HDD
> >
>
> Hi Dongdong,
>
> Nice charts :-)
>
> > Steps:
> >
> > 1.
> >
> > make-bcache -B <HDD> -C <NVME> --writeback
> >
> > 2.
> >
> > sudo fio --name=random-writers --filename=/dev/bcache0
> > --ioengine=libaio --iodepth=1 --rw=randrw --blocksize=64k,8k
> > --direct=1 --numjobs=1 --write_lat_log=mix --log_avg_msec=10
> > > The fio benchmark commands ran for about 20 hours.
> >
>
> The time lengths of first 3 charts are 7.000e+7, rested are 1.60930e+9.
> I guess the time length of the I/O latency chart is 1/100 of the rested.
>
> Can you also post the latency charts for 1.60930e+9 seconds? Then I can
> compare the latency with dirty data and available cache charts.
>
>
> Thanks.
>
>
> Coly Li
>
>
>
>
>
> >
> > Let’s have a look at the write latency first:
> >
> > Master:
> >
> >
> >
> > Master+the patch:
> >
> > Combine them together:
> >
> > Again, the latency (y-axis) is based on nano-second, x-axis is the
> > timestamp based on milli-second, as we can see the master latency is
> > obviously much higher than the one with my patch when the master bcache
> > hit the cutoff writeback sync, the master isn’t going to get out of this
> > cutoff writeback sync situation, This graph showed it already stuck at
> > the cutoff writeback sync for about 4 hours before I finish the testing,
> > it may still needs to stuck for days before it can get out this
> > situation itself.
> >
> >
> > Note that there are 1 million points for each , red represents master,
> > green represents mater+my patch. Most of them are overlapped with each
> > other, so it may look like this graph has more red points then green
> > after it hitting the cutoff, but simply it’s because the latency has
> > scaled to a bigger range which represents the HDD latency.
> >
> >
> >
> > Let’s also have a look at the bcache’s cache available percent and dirty
> > data percent.
> >
> > Master:
> >
> > Master+this patch:
> >
> > As you can see, this patch can avoid it hitting the cutoff writeback sync.
> >
> >
> > As to say the improvement for this patch against the first one, let’s
> > take a look at the writeback rate changing during the run.
> >
> > patch V1:
> >
> >
> >
> > Patch V2:
> >
> >
> > The Y-axis is the value of rate, the V1 is very aggressive as it jumps
> > instantly from a minimum 8 to around 10 million. And the patch V2 can
> > control the rate under 5000 during the run, and after the first round of
> > writeback, it can stay even under 2500, so this proves we don’t need to
> > be as aggressive as V1 to get out of the high fragment situation which
> > eventually causes all writes hitting the backing device. This looks very
> > reasonable for me now.
> >
> > Note that the fio command that I used is consuming the bucket quite
> > aggressively, so it had to hit the third stage which has the highest
> > aggressiveness, but I believe this is not true in a real production env,
> > real production env won’t consume buckets that aggressively, so I expect
> > stage 3 may not very often be needed to hit.
> >
> >
> > As discussed, I'll run multiple block size testing on at least 1TB NVME
> > device later.
> > But it might take some time.
> >
> >
> > Regards,
> > Dongdong
> >
> > On Tue, Jan 5, 2021 at 12:33 PM Coly Li <colyli@xxxxxxx
> > <mailto:colyli@xxxxxxx>> wrote:
> >
> > On 1/5/21 11:44 AM, Dongdong Tao wrote:
> > > Hey Coly,
> > >
> > > This is the second version of the patch, please allow me to explain a
> > > bit for this patch:
> > >
> > > We accelerate the rate in 3 stages with different aggressiveness, the
> > > first stage starts when dirty buckets percent reach above
> > > BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW(50), the second is
> > > BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID(57) and the third is
> > > BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH(64). By default the first stage
> > > tries to writeback the amount of dirty data in one bucket (on average)
> > > in (1 / (dirty_buckets_percent - 50)) second, the second stage
> > tries to
> > > writeback the amount of dirty data in one bucket in (1 /
> > > (dirty_buckets_percent - 57)) * 200 millisecond. The third stage tries
> > > to writeback the amount of dirty data in one bucket in (1 /
> > > (dirty_buckets_percent - 64)) * 20 millisecond.
> > >
> > > As we can see, there are two writeback aggressiveness increasing
> > > strategies, one strategy is with the increasing of the stage, the
> > first
> > > stage is the easy-going phase whose initial rate is trying to
> > write back
> > > dirty data of one bucket in 1 second, the second stage is a bit more
> > > aggressive, the initial rate tries to writeback the dirty data of one
> > > bucket in 200 ms, the last stage is even more, whose initial rate
> > tries
> > > to writeback the dirty data of one bucket in 20 ms. This makes sense,
> > > one reason is that if the preceding stage couldn’t get the
> > fragmentation
> > > to a fine stage, then the next stage should increase the
> > aggressiveness
> > > properly, also it is because the later stage is closer to the
> > > bch_cutoff_writeback_sync. Another aggressiveness increasing
> > strategy is
> > > with the increasing of dirty bucket percent within each stage, the
> > first
> > > strategy controls the initial writeback rate of each stage, while this
> > > one increases the rate based on the initial rate, which is
> > initial_rate
> > > * (dirty bucket percent - BCH_WRITEBACK_FRAGMENT_THRESHOLD_X).
> > >
> > > The initial rate can be controlled by 3 parameters
> > > writeback_rate_fp_term_low, writeback_rate_fp_term_mid,
> > > writeback_rate_fp_term_high, they are default 1, 5, 50, users can
> > adjust
> > > them based on their needs.
> > >
> > > The reason that I choose 50, 57, 64 as the threshold value is because
> > > the GC must be triggered at least once during each stage due to the
> > > “sectors_to_gc” being set to 1/16 (6.25 %) of the total cache
> > size. So,
> > > the hope is that the first and second stage can get us back to good
> > > shape in most situations by smoothly writing back the dirty data
> > without
> > > giving too much stress to the backing devices, but it might still
> > enter
> > > the third stage if the bucket consumption is very aggressive.
> > >
> > > This patch use (dirty / dirty_buckets) * fp_term to calculate the
> > rate,
> > > this formula means that we want to writeback (dirty /
> > dirty_buckets) in
> > > 1/fp_term second, fp_term is calculated by above aggressiveness
> > > controller, “dirty” is the current dirty sectors, “dirty_buckets”
> > is the
> > > current dirty buckets, so (dirty / dirty_buckets) means the average
> > > dirty sectors in one bucket, the value is between 0 to 1024 for the
> > > default setting, so this formula basically gives a hint that to
> > reclaim
> > > one bucket in 1/fp_term second. By using this semantic, we can have a
> > > lower writeback rate when the amount of dirty data is decreasing and
> > > overcome the fact that dirty buckets number is always increasing
> > unless
> > > GC happens.
> > >
> > > *Compare to the first patch:
> > > *The first patch is trying to write back all the data in 40 seconds,
> > > this will result in a very high writeback rate when the amount of
> > dirty
> > > data is big, this is mostly true for the large cache devices. The
> > basic
> > > problem is that the semantic of this patch is not ideal, because we
> > > don’t really need to writeback all dirty data in order to solve this
> > > issue, and the instant large increase of the rate is something I
> > feel we
> > > should better avoid (I like things to be smoothly changed unless no
> > > choice: )).
> > >
> > > Before I get to this new patch(which I believe should be optimal
> > for me
> > > atm), there have been many tuning/testing iterations, eg. I’ve
> > tried to
> > > tune the algorithm to writeback ⅓ of the dirty data in a certain
> > amount
> > > of seconds, writeback 1/fragment of the dirty data in a certain amount
> > > of seconds, writeback all the dirty data only in those error_buckets
> > > (error buckets = dirty buckets - 50% of the total buckets) in a
> > certain
> > > amount of time. However, those all turn out not to be ideal, only the
> > > semantic of the patch makes much sense for me and allows me to control
> > > the rate in a more precise way.
> > >
> > > *Testing data:
> > > *I'll provide the visualized testing data in the next couple of days
> > > with 1TB NVME devices cache but with HDD as backing device since it's
> > > what we mostly used in production env.
> > > I have the data for 400GB NVME, let me prepare it and take it for
> > you to
> > > review.
> > [snipped]
> >
> > Hi Dongdong,
> >
> > Thanks for the update and continuous effort on this idea.
> >
> > Please keep in mind the writeback rate is just a advice rate for the
> > writeback throughput, in real workload changing the writeback rate
> > number does not change writeback throughput obviously.
> >
> > Currently I feel this is an interesting and promising idea for your
> > patch, but I am not able to say whether it may take effect in real
> > workload, so we do need convinced performance data on real workload and
> > configuration.
> >
> > Of course I may also help on the benchmark, but my to-do list is long
> > enough and it may take a very long delay time.
> >
> > Thanks.
> >
> > Coly Li
> >
>