Re: [dm-devel] [PATCH v2] staging: writeboost: Add dm-writeboost
From: Akira Hayakawa
Date: Sat Dec 13 2014 - 21:12:54 EST
Hi,
> The major reason is, it needs to read full 512KB segment to calculate checksum to
> know if the log isn't half written.
> (Read 500GB SSD that performs 500MB/sec seqread spends 1000secs)
I've just measured how long the cache resuming is.
I use 2GB SSD for the cache device.
512KB seqread over the cache device:
8.252sec (242MB/sec)
Resume when all caches are dirty:
10.339sec
Typically, if you use 128GB SSD it will be 5-10 minutes.
As I predicted, resuming time is close to seqread.
In other words, it's fully IO-bound.
(If you read the code you will notice that it first searchs for the
oldest log as the starting point. It's 4KB metadata reads but spends to some extent.
The other 2 sec is thought to be spent by this)
- Akira
On 12/13/14 11:07 PM, Akira Hayakawa wrote:
> Hi,
>
> Jianjian, You really get a point at the fundamental design.
>
>> If I understand it correctly, the whole idea indeed is very simple,
>> the consumer/provider and circular buffer model. use SSD as a circular
>> write buffer, write flush thread stores incoming writes to this buffer
>> sequentially as provider, and writeback thread write those data logs
>> sequentially into backing device as consumer.
>>
>> If writeboost can do that without any random writes, then probably it
>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>> sequential read/write performance from SSD. That'll be awesome.
>> However, I saw every data log segment in its design has meta data
>> header, like dirty_bits, so I guess writeboost has to randomly write
>> those data headers of stored data logs in SSD; also, splitting all bio
>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>> NAND Flash has pages much bigger than 4KB; so overall I think the
>> actual benefits writeboost gets from this design will be discounted.
> You understand *almost* correctly.
>
> Writeboost has two circular buffers, not one; RAM buffers and SSD.
> The incoming bio is split into 4KB chunks at the virtual make_request
> and are NOT directly remapped to the SSD.
> As you mentioned, if I designed so, many update on the metadata happens.
> That's really bad since SSD is very bad at small update.
>
> Actually, the 4KB bio is first stored in RAM buffer, which is 512KB large.
> There are (512-4)/4=127 4KB bio data stored in the RAM buffer and 4KB metadata
> section at the head is made after that.
>
> The RAM buffer is now called "log" and as you mentioned, flushed to the SSD
> as 512KB sequential write. This definitely maximizes throughput and lifetime.
>
> Unfortunately, this is not always the case because of barrier request handlings.
> But, when the writes is really heavy (e.g. massive dirty page writeback),
> Writeboost works as above.
>
>> The good thing is that it seems writeboost doesn't use garbage
>> collection to clean old invalid logs, this will avoid the double
>> garage collection effect other caching module has, which essentially
>> both caching module and internal SSD will perform garbage collections
>> twice.
> Yes. And I believe SSDs can remove wear-leveling if I used it as fairly sequential.
> Am I right? Indeed, Writeboost is really SSD frinedly.
>
>> And one question, how long will be data logs replay time during init,
>> if SSD is almost full of dirty data logs?
> Sorry, I don't have a data now but it's slow as you may imagine.
> I will measure the time on later.
>
> The major reason is, it needs to read full 512KB segment to calculate checksum to
> know if the log isn't half written.
> (Read 500GB SSD that performs 500MB/sec seqread spends 1000secs)
> I think making the procedure done in parallel to exploit the full internal parallelism
> inside SSD may improve performance but it's just the matter of coefficient down from 1 to 1/n.
> Definitely, Writeboost isn't fit for a machine that needs reboot frequently (e.g. desktop).
>
> There is a way to reduce the init time. We can dump "what is the latest log written back"
> on the superblock. This can skip readings that aren't essential.
>
> The corresponding code is replay_log_on_cache() function. Please read if you are
> interested.
>
> Thanks,
>
> - Akira
>
> On 12/13/14 3:45 PM, Jianjian Huo wrote:
>> If I understand it correctly, the whole idea indeed is very simple,
>> the consumer/provider and circular buffer model. use SSD as a circular
>> write buffer, write flush thread stores incoming writes to this buffer
>> sequentially as provider, and writeback thread write those data logs
>> sequentially into backing device as consumer.
>>
>> If writeboost can do that without any random writes, then probably it
>> can save SSD/FTL of doing a lot of dirty jobs, and utilize the faster
>> sequential read/write performance from SSD. That'll be awesome.
>> However, I saw every data log segment in its design has meta data
>> header, like dirty_bits, so I guess writeboost has to randomly write
>> those data headers of stored data logs in SSD; also, splitting all bio
>> into 4KB will hurt its ability to get max raw SSD throughput, modern
>> NAND Flash has pages much bigger than 4KB; so overall I think the
>> actual benefits writeboost gets from this design will be discounted.
>>
>> The good thing is that it seems writeboost doesn't use garbage
>> collection to clean old invalid logs, this will avoid the double
>> garage collection effect other caching module has, which essentially
>> both caching module and internal SSD will perform garbage collections
>> twice.
>>
>> And one question, how long will be data logs replay time during init,
>> if SSD is almost full of dirty data logs?
>>
>> Jianjian
>>
>> On Fri, Dec 12, 2014 at 7:09 AM, Akira Hayakawa <ruby.wktk@xxxxxxxxx> wrote:
>>>> However, after looking at the current code, and using it I think it's
>>>> a long, long way from being ready for production. As we've already
>>>> discussed there are some very naive design decisions in there, such as
>>>> copying every bio payload to another memory buffer, splitting all io
>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>> Think about how it will perform when memory is constrained and it
>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>> found if I read further.
>>> These decisions are made based on measurement. They are not naive.
>>> I am a man who dislikes performance optimization without measurement.
>>> As a result, I regard things brought by the simplicity much important
>>> than what's from other design decisions possible.
>>>
>>> About the CPU consumption,
>>> the average CPU consumption while performing random write fio
>>> with consumer level SSD is only 3% or so,
>>> which is 5 times efficient than bcache per iops.
>>>
>>> With RAM-backed cache device, it reaches about 1.5GB/sec throughput.
>>> Even in this case the CPU consumption is only 12%.
>>> Please see this post,
>>> http://www.redhat.com/archives/dm-devel/2014-February/msg00000.html
>>>
>>> I don't think the CPU consumption is small enough to ignore.
>>>
>>> About the memory consumption,
>>> you seem to misunderstand the fact.
>>> The rambufs are not dynamically allocated but statically.
>>> The default amount is 8MB and this is usually not to argue.
>>>
>>>> Mike raised the question of why you want this in the kernel so much?
>>>> You'd find none of the distros would support it; so it doesn't widen
>>>> your audience much. It's far better for you to maintain it outside of
>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>> who will be quite capable of building a kernel module.
>>> Some people deploy Writeboost in their daily use.
>>> The sound of "log-structured" seems to easily attract storage guys' attention.
>>> If this driver is merged into upstream, I think it gains many audience and
>>> thus feedback.
>>> When my driver was introduced by Phoronix before, it actually drew attentions.
>>> They must wait for Writeboost become available in upstream.
>>> http://www.phoronix.com/scan.php?page=news_item&px=MTQ1Mjg
>>>
>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>> it would mean a massive amount of support work for me, not to mention
>>>> a damaged reputation for dm.
>>> If you read the code further, you will find how simple the mechanism is.
>>> Not to mention the code itself is.
>>>
>>> - Akira
>>>
>>> On 12/12/14 11:24 PM, Joe Thornber wrote:
>>>> On Fri, Dec 12, 2014 at 09:42:15AM +0900, Akira Hayakawa wrote:
>>>>> The SSD-caching should be log-structured.
>>>>
>>>> No argument there, and this is why I've supported you with
>>>> dm-writeboost over the last couple of years.
>>>>
>>>> However, after looking at the current code, and using it I think it's
>>>> a long, long way from being ready for production. As we've already
>>>> discussed there are some very naive design decisions in there, such as
>>>> copying every bio payload to another memory buffer, splitting all io
>>>> down to 4k. Think about the cpu overhead and memory consumption!
>>>> Think about how it will perform when memory is constrained and it
>>>> can't allocate many of those rambufs! I'm sure more issues will be
>>>> found if I read further.
>>>>
>>>> I'm sorry to have disappointed you so, but if I let this go upstream
>>>> it would mean a massive amount of support work for me, not to mention
>>>> a damaged reputation for dm.
>>>>
>>>> Mike raised the question of why you want this in the kernel so much?
>>>> You'd find none of the distros would support it; so it doesn't widen
>>>> your audience much. It's far better for you to maintain it outside of
>>>> the kernel at this point. Any users will be bold, adventurous people,
>>>> who will be quite capable of building a kernel module.
>>>>
>>>> - Joe
>>>>
>>>
>>> --
>>> dm-devel mailing list
>>> dm-devel@xxxxxxxxxx
>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>> --
>> dm-devel mailing list
>> dm-devel@xxxxxxxxxx
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/