Re: [RFC] dm-writeboost: Persistent memory support

From: Jerome Glisse
Date: Fri Feb 28 2014 - 14:46:24 EST


On Fri, Oct 04, 2013 at 10:37:21PM +0900, Akira Hayakawa wrote:
> Hi, all
>
> Let me introduce my future plan
> of applying persistent memory to dm-writeboost.
> dm-writeboost can potentially
> gain many benefits by the persistent memory.
>
> (1) Problem
> The basic mechanism of dm-writeboost is
> (i) first stores the write data to RAM buffer
> whose size is 1MB at maximum and
> can include 255 * 4KB data.
> (ii) when the RAM buffer is fulfilled
> it packs the data and its metadata
> which indicates where to write back,
> into a structure called "log" and
> queues it.
> (iii) the log is flushed to the cache device
> in background.
> (iv) and later migrated or written back
> to the backing store in background.
>
> The problem is in handling barrier writes
> flagged with REA_FUA or REQ_FLUSH.
> Upper layer waits for these kind of bios complete
> so waiting for log to be fulfilled and then queued
> may stall the upper layer.
> One of the methods in receiving these bios is that
> dm-writeboost makes a "partial" log and queues it
> which causes potentially random writes to the
> cache device(SSD) which not only loses its performance
> but also fails to maximize the lifetime of the SSD device.
> Moreover, it consumes CPU cycles to make a partial log
> again and again. It is not free.
>
> So, dm-writeboost provides a tunable parameter called
> barrier_deadline_ms that indicates the
> worst time guaranteed that these unusually flagged bios queued.
> Making a partial log is deferred and
> it means that the log can be fulfilled before the deadline
> if there are many processes submitting writes.
>
> In summary,
> due to the REQ_FUA and REQ_FLUSH flag
> dm-writeboost can not guarantee the log always fulfilled.
> Imagine there is only one process above the dm-writeboost device
> and rediculously submits REQ_FUAed bio and waits for the completion repeatly.
> This is the worst case for dm-writeboost
> the log is always partial and the process always waits for
> the deadline.
>
> If the RAM buffer is smaller than 1MB
> the log is likely to be fulfilled.
> The size of the RAM buffer is tunable in constructor.
> However, this is not the ultimate solution.
>
> So, let's find the ultimate solution next.
>
> (2) What if RAM buffer is non-volatile
> If we use persistent memory for the RAM buffer instead of
> DRAM which is volatile
> we don't need to partially flush the log
> to complete these flagged bios quickly
> but can do away with only writing the data
> to the persistent RAM buffer and then returning ACK.
>
> This means
> the 1MB log will be always fulfilled
> and the upper layer will never be annoyed with
> how to handle the REQ_FUA or REQ_FLUSH flagged bios.
> This will always maximize the write thoughput to the SSD device
> and maximize its lifetime.
>
> Futhermore,
> upper layer can eliminate the
> optimization for these bios.
> For example, XFS also does the same technique
> of gathering the barriers as explained by Dave Chinner in
> https://lkml.org/lkml/2013/10/3/804
>
> Using dm-writeboost with persistent memory
> the upper layer will be alliviated
> from doing difficult things.
> Applying persistent memory to dm-writeboost is promising.
>
> Any comment?
>
> (3) Design Change
> I have read this thread in LKML
> "RFC Block Layer Extensions to Support NV-DIMMs"
> https://lkml.org/lkml/2013/9/4/555
>
> The interface design is still in discussion but
> I hope to see an interface design that deals with
> persistent memory as the new type of memory
> not the block device.
>
> Even if the RAM buffer is switch to
> volatile to non-volatile
> the basic I/O path of dm-writeboost will not change.
> I think most of the code can be shared between
> volatile mode and non-volatile mode of dm-writeboost.
> So, switching the mode in constructor parameter
> will be my design choice.
>
> Maybe the constructor will be like this
> writeboost <mode> ...
> writeboost 0 <backing store> <cache device> ....
> writeboost 1 <backing store> <cache device> <persistent memory> ...
>
> If the mode is 0 it builds a writeboost device with volatile RAM buffer
> and the mode is 1 it builds with non-volatile RAM buffer.
>
> The current design doesn't have mode parameter
> so adding the parameter right now could be our design choice
> but even if we don't add it right now
> the backward-compatibility can be guaranteed
> by implicitly setting the mode to 0 if the first parameter is not a number.
> I prefer adding it right now for future design consistency.
>
> Should or Shouldn't I add the paramter before
> making a patch to device-mapper tree?
>
> (4) Prototype
> I think I can start prototyping
> by defining a pseudo persistent memory backed by a block device.
>
> The temporary interface will be defined like:
> struct pmem *pmem_alloc(struct block_device *, size_t start, size_t len);
> void pmem_write(struct pmem *, size_t start, size_t len, void *data);
> void pmem_read(struct pmem *, size_t start, size_t len, void *dest);
> void pmem_free(struct pmem *);
>
> Byte-addressableness is implemented by Read-Modify-Write.
>
> The difficulty in using the persistent memory instead
> is in recovering the data both on the RAM buffer and the cache device
> in rebooting.
> The implementation will be complicated but
> can mostly be limited under recover_cache() routine
> and the outside of it will not be badly tainted.
>
> Should I prototype before making patch to device-mapper tree?
>
> Akira

Just jumping in. I am working on new API to allow mirroring process address
on a device. The devices we are targeting sit behind IOMMU and i fear that
in some case the persistent memory will not be accessible from behind the
IOMMU.

In such case it is important to be able to enforce for some range of memory
to go through the normal page cache volatile memory.

Even when the persistent memory is accessible from behind the IOMMU we will
want to mirror memory in local device memory for more or long period of time
and thus will need way again to make range of persistent to behave like if
things were going through volatile memory.

I hope to send a patchset for comment in April and at that time it will be
easier for everyone to see the internal of how things are done but in a
nutshell device memory is consired swap and page cache entry can be swap
to the device memory.

Cheers,
Jérôme

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/