Re: Distributed storage.

From: Evgeniy Polyakov
Date: Sat Aug 04 2007 - 12:41:04 EST


On Fri, Aug 03, 2007 at 06:19:16PM -0700, Daniel Phillips (phillips@xxxxxxxxx) wrote:
> It depends on the characteristics of the physical and virtual block
> devices involved. Slow block devices can produce surprising effects.
> Ddsnap still qualifies as "slow" under certain circumstances (big
> linear write immediately following a new snapshot). Before we added
> throttling we would see as many as 800,000 bios in flight. Nice to

Mmm, sounds tasty to work with such a system :)

> know the system can actually survive this... mostly. But memory
> deadlock is a clear and present danger under those conditions and we
> did hit it (not to mention that read latency sucked beyond belief).
>
> Anyway, we added a simple counting semaphore to throttle the bio traffic
> to a reasonable number and behavior became much nicer, but most
> importantly, this satisfies one of the primary requirements for
> avoiding block device memory deadlock: a strictly bounded amount of bio
> traffic in flight. In fact, we allow some bounded number of
> non-memalloc bios *plus* however much traffic the mm wants to throw at
> us in memalloc mode, on the assumption that the mm knows what it is
> doing and imposes its own bound of in flight bios per device. This
> needs auditing obviously, but the mm either does that or is buggy. In
> practice, with this throttling in place we never saw more than 2,000 in
> flight no matter how hard we hit it, which is about the number we were
> aiming at. Since we draw our reserve from the main memalloc pool, we
> can easily handle 2,000 bios in flight, even under extreme conditions.
>
> See:
> http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
> down(&info->throttle_sem);
>
> To be sure, I am not very proud of this throttling mechanism for various
> reasons, but the thing is, _any_ throttling mechanism no matter how
> sucky solves the deadlock problem. Over time I want to move the

make_request_fn is always called in process context, we can wait in it
for memory in mempool. Although that means we already in trouble.

I agree, any kind of high-boundary leveling must be implemented in
device itself, since block layer does not know what device is at the end
and what it will need to process given block request.

--
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/