Re: [PATCH v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list
From: Jan Kara
Date: Tue Feb 14 2023 - 03:50:27 EST
Hi Ojaswin!
On Mon 13-02-23 23:28:41, Ojaswin Mujoo wrote:
> On Fri, Feb 10, 2023 at 03:37:53PM +0100, Jan Kara wrote:
> > > See my comments at the end for more info.
> > >
> > > >
> > > > > Also, another thing I noticed is that after ext4_mb_normalize_request(),
> > > > > sometimes the original range can also exceed the normalized goal range,
> > > > > which is again was a bit surprising to me since my understanding was
> > > > > that normalized range would always encompass the orignal range.
> > > >
> > > > Well, isn't that because (part of) the requested original range is already
> > > > preallocated? Or what causes the goal range to be shortened?
> > > >
> > > Yes I think that pre existing PAs could be one of the cases.
> > >
> > > Other than that, I'm also seeing some cases of sparse writes which can cause
> > > ext4_mb_normalize_request() to result in having an original range that
> > > overflows out of the goal range. For example, I observed these values just
> > > after the if else if else conditions in the function, before we check if range
> > > overlaps pre existing PAs:
> > >
> > > orig_ex:2045/2055(len:10) normalized_range:0/2048, orig_isize:8417280
> > >
> > > Basically, since isize is large and we are doing a sparse write, we end
> > > up in the following if condition:
> > >
> > > } else if (NRL_CHECK_SIZE(ac->ac_o_ex.fe_len,
> > > (8<<20)>>bsbits, max, 8 * 1024)) {
> > > start_off = ((loff_t)ac->ac_o_ex.fe_logical >> (23 - bsbits)) << 23;
> > > size = 8 * 1024 * 1024;
> > > }
> > >
> > > Resulting in normalized range less than original range.
> >
> > I see.
> >
> > > Now, in any case, once we get such an overflow, if we try to enter the PA
> > > adjustment window in ext4_mb_new_inode_pa() function, we will again end
> > > up with a best extent overflowing out of goal extent since we would try
> > > to cover the original extent.
> > >
> > > So yeah, seems like there are 2 cases where we could result in
> > > overlapping PAs:
> > >
> > > 1. Due to off calculation in PA adjustment window, as we discussed. 2.
> > > Due to original extent overflowing out of goal extent.
> > >
> > > I think the 3 step solution you proposed works well to counter 1 but not
> > > 2, so we probably need some more logic on top of your solution to take
> > > care of that. I'll think some more on how to fix this but I think this
> > > will be as a separate patch.
> >
> > Well, my algorithm will still result in preallocation being within the goal
> > range AFAICS. In the example above we had:
> >
> > Orig 2045/2055 Goal 0/2048
> >
> > Suppose we found 200 blocks. So we try placing preallocation like:
> >
> > 1848/2048, it covers the original starting block 2045 so we are fine and
> > create preallocation 1848/2048. Nothing has overflown the goal window...
>
> Ahh okay, I though you meant checking if we covered the complete
> original range instead of just the original start. In that case we
> should be good.
>
> However, I still feel that the example where we unnecessarily end up with
> a lesser goal len than original len should not happen. Maybe in
> ext4_mb_normalize_request(), instead of hardcording the goal lengths for
> different i_sizes, we can select the next power of 2 greater than our
> original length or some similar heuristic. What do you think?
I agree that seems suboptimal but I'd leave tweaking this heuristic for a
separate patchset. In this one I'd just fix the minimum to make ranges not
overlap. The current heuristic makes sense as an anti-fragmentation measure
when there's enough free space. We can tweak the goal window heuristic of
mballoc but it would require some deeper thought and measurements, how it
affects fragmentation for various workloads so it is not just about
changing those several lines of code...
Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR