Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)

From: david
Date: Thu Aug 13 2009 - 16:45:17 EST


On Thu, 13 Aug 2009, Greg Freemyer wrote:

On Thu, Aug 13, 2009 at 12:33 PM, <david@xxxxxxx> wrote:
On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:

On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:

I am planning a complete overhaul of the discard work.  Users can send
down discard requests as frequently as they like.  The block layer will
cache them, and invalidate them if writes come through.  Periodically,
the block layer will send down a TRIM or an UNMAP (depending on the
underlying device) and get rid of the blocks that have remained unwanted
in the interim.

That is a very good idea. I've tested your original TRIM implementation on
my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
milliseconds to digest a single TRIM command. And since your
implementation
sends a TRIM for each extent of each deleted file, the whole system is
unusable after a short while.
An optimal solution would be to consolidate the discard requests, bundle
them and send them to the drive as infrequent as possible.

or queue them up and send them when the drive is idle (you would need to
keep track to make sure the space isn't re-used)

as an example, if you would consider spinning down a drive you don't hurt
performance by sending accumulated trim commands.

David Lang

An alternate approach is the block layer maintain its own bitmap of
used unused sectors / blocks. Unmap commands from the filesystem just
cause the bitmap to be updated. No other effect.

how does the block layer know what blocks are unused by the filesystem?

or would it be a case of the filesystem generating discard/trim requests to the block layer so that it can maintain it's bitmap, and then the block layer generating the requests to the drive below it?

David Lang

(Big unknown: Where will the bitmap live between reboots? Require DM
volumes so we can have a dedicated bitmap volume in the mix to store
the bitmap to? Maybe on mount, the filesystem has to be scanned to
initially populate the bitmap? Other options?)

Assuming we have a persistent bitmap in place, have a background
scanner that kicks in when the cpu / disk is idle. It just
continuously scans the bitmap looking for contiguous blocks of unused
sectors. Each time it finds one, it sends the largest possible unmap
down the block stack and eventually to the device.

When normal cpu / disk activity kicks in, this process goes to sleep.

That way much of the smarts are concentrated in the block layer, not
in the filesystem code. And it is being done when the disk is
otherwise idle, so you don't have the ncq interference.

Even laptop users should have enough idle cpu available to manage
this. Enterprise would get the large discards it wants, and
unmentioned in the previous discussion, mdraid gets the large discards
it also wants.

ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
is lost.

Another benefit of the above is the code should be extremely safe and testable.

Greg