Re: Discard support (was Re: [PATCH] swap: send callback when swap slot is freed)

From: Greg Freemyer
Date: Thu Aug 13 2009 - 14:15:40 EST


On Thu, Aug 13, 2009 at 12:33 PM, <david@xxxxxxx> wrote:
> On Thu, 13 Aug 2009, Markus Trippelsdorf wrote:
>
>> On Thu, Aug 13, 2009 at 08:13:12AM -0700, Matthew Wilcox wrote:
>>>
>>> I am planning a complete overhaul of the discard work.  Users can send
>>> down discard requests as frequently as they like.  The block layer will
>>> cache them, and invalidate them if writes come through.  Periodically,
>>> the block layer will send down a TRIM or an UNMAP (depending on the
>>> underlying device) and get rid of the blocks that have remained unwanted
>>> in the interim.
>>
>> That is a very good idea. I've tested your original TRIM implementation on
>> my Vertex yesterday and it was awful ;-). The SSD needs hundreds of
>> milliseconds to digest a single TRIM command. And since your
>> implementation
>> sends a TRIM for each extent of each deleted file, the whole system is
>> unusable after a short while.
>> An optimal solution would be to consolidate the discard requests, bundle
>> them and send them to the drive as infrequent as possible.
>
> or queue them up and send them when the drive is idle (you would need to
> keep track to make sure the space isn't re-used)
>
> as an example, if you would consider spinning down a drive you don't hurt
> performance by sending accumulated trim commands.
>
> David Lang

An alternate approach is the block layer maintain its own bitmap of
used unused sectors / blocks. Unmap commands from the filesystem just
cause the bitmap to be updated. No other effect.

(Big unknown: Where will the bitmap live between reboots? Require DM
volumes so we can have a dedicated bitmap volume in the mix to store
the bitmap to? Maybe on mount, the filesystem has to be scanned to
initially populate the bitmap? Other options?)

Assuming we have a persistent bitmap in place, have a background
scanner that kicks in when the cpu / disk is idle. It just
continuously scans the bitmap looking for contiguous blocks of unused
sectors. Each time it finds one, it sends the largest possible unmap
down the block stack and eventually to the device.

When normal cpu / disk activity kicks in, this process goes to sleep.

That way much of the smarts are concentrated in the block layer, not
in the filesystem code. And it is being done when the disk is
otherwise idle, so you don't have the ncq interference.

Even laptop users should have enough idle cpu available to manage
this. Enterprise would get the large discards it wants, and
unmentioned in the previous discussion, mdraid gets the large discards
it also wants.

ie. If a mdraid raid5/raid6 volume is built of SSDs, it will only be
able to discard a full stripe at a time. Otherwise the P=D1 ^ D2 logic
is lost.

Another benefit of the above is the code should be extremely safe and testable.

Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/