Re: [dm-devel] [PATCH] dm-region-hash: fix strange usage of mempool_alloc.

From: NeilBrown
Date: Wed May 03 2017 - 21:22:28 EST

On Wed, May 03 2017, Mikulas Patocka wrote:

> On Mon, 24 Apr 2017, NeilBrown wrote:
>> I had a look at how the allocation 'dm_region' objects are used,
>> and it would take a bit of work to make it really safe.
>> My guess is __rh_find() should be allowed to fail, and the various
>> callers need to handle failure.
>> For example, dm_rh_inc_pending() would be given a second bio_list,
>> and would move any bios for which rh_inc() fails, onto that list.
>> Then do_writes() would merge that list back into ms->writes.
>> That way do_mirror() would not block indefinitely and forward progress
>> could be assured ... maybe.
>> It would take more work than I'm able to give at the moment, so
>> I'm happy to just drop this patch.
>> Thanks,
>> NeilBrown
> I think that the only way how to fix this would be to preallocate the all
> the regions when the target is created.
> But, with the default region size 512kiB, it would cause high memory
> consumption (approximatelly 1GB of RAM for 20TB device).

Two reflections:
1/ This is close to what md/bitmap does.
It actually uses a 2-level array for the 'pending' field from
dm_region, combined with something similar to 'state'.
The top level is allocated when the device is created.
Entries in this table are either
- pointers to a second-level array for 2048 regions
- entries for 2 giant regions, 1024 times the normal size.

So if we cannot allocate a page when we need that second level,
we just use an enormous region and so risk resync taking a bit
longer if there is a crash.

2/ Even though md does pre-alloc to a degree, I'm not convinced that it
is necessary.
We only need a region to be recorded when it is actively being
written to, or when it is being recovered.
We could, in theory, have just one region that is written to and one
region that is being recovered. If a writes request arrives for a
different region it blocks until the current region has no active
requests. Then that region is forgotten and the new region
activated, and the new write allowed to proceed.
Obviously this would be horribly slow, but it should be
Using a mempool instead of a single region would then allow multiple
regions to be active in parallel, which would improve throughput
without affecting the deadlock status.

Maybe I'll try to code it and see what happens ... maybe not.


Attachment: signature.asc
Description: PGP signature