Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations
From: Ilya Dryomov
Date: Thu Mar 30 2017 - 11:06:59 EST
On Thu, Mar 30, 2017 at 4:36 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> On Thu 30-03-17 15:48:42, Ilya Dryomov wrote:
>> On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> [...]
>> > familiar with Ceph at all but does any of its (slab) shrinkers generate
>> > IO to recurse back?
>>
>> We don't register any custom shrinkers. This is XFS on top of rbd,
>> a ceph-backed block device.
>
> OK, that was the part I was missing. So you depend on the XFS to make a
> forward progress here.
>
>> >> Well,
>> >> it's got to go through the same ceph_connection:
>> >>
>> >> rbd_queue_workfn
>> >> ceph_osdc_start_request
>> >> ceph_con_send
>> >> mutex_lock(&con->mutex) # deadlock, OSD X worker is knocked out
>> >>
>> >> Now if that was a GFP_NOIO allocation, we would simply block in the
>> >> allocator. The placement algorithm distributes objects across the OSDs
>> >> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
>> >> that OSD, some other I/Os for other OSDs would complete in the meantime
>> >> and free up memory. If we are under the kind of memory pressure that
>> >> makes GFP_NOIO allocations block for an extended period of time, we are
>> >> bound to have a lot of pre-open sockets, as we would have done at least
>> >> some flushing by then.
>> >
>> > How is this any different from xfs waiting for its IO to be done?
>>
>> I feel like we are talking past each other here. If the worker in
>> question isn't deadlocked, it will eventually get its socket and start
>> flushing I/O. If it has deadlocked, it won't...
>
> But if the allocation is stuck then the holder of the lock cannot make
> a forward progress and it is effectivelly deadlocked because other IO
> depends on the lock it holds. Maybe I just ask bad questions but what
Only I/O to the same OSD. A typical ceph cluster has dozens of OSDs,
so there is plenty of room for other in-flight I/Os to finish and move
the allocator forward. The lock in question is per-ceph_connection
(read: per-OSD).
> makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> might need to wait for an IO to finish in the shrinker but it itself
> doesn't get the lock in question directly. The former depends on the
> allocator forward progress as well and that in turn wait for somebody
> else to proceed with the IO. So to me any blocking allocation while
> holding a lock which blocks further IO to complete is simply broken.
Right, with GFP_NOIO we simply wait -- there is nothing wrong with
a blocking allocation, at least in the general case. With GFP_KERNEL
we deadlock, either in rbd/libceph (less likely) or in the filesystem
above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
omitted in your quote).
Thanks,
Ilya