Re: [RFC v3 8/8] ext4: Remove the logic to trim inode PAs

From: Ojaswin Mujoo
Date: Thu Oct 06 2022 - 02:55:41 EST


On Thu, Sep 29, 2022 at 02:53:11PM +0200, Jan Kara wrote:
> On Tue 27-09-22 14:46:48, Ojaswin Mujoo wrote:
> > Earlier, inode PAs were stored in a linked list. This caused a need to
> > periodically trim the list down inorder to avoid growing it to a very
> > large size, as this would severly affect performance during list
> > iteration.
> >
> > Recent patches changed this list to an rbtree, and since the tree scales
> > up much better, we no longer need to have the trim functionality, hence
> > remove it.
> >
> > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
>
> I'm kind of wondering: Now there won't be performance issues with much
> more inode PAs but probably we don't want to let them grow completely out
> of control? E.g. I can imagine that if we'd have 1 billion of inode PAs
> attached to an inode, things would get wonky both in terms of memory
> consumption and also in terms of CPU time spent for the cases where we
> still do iterate all of the PAs... Is there anything which keeps inode PAs
> reasonably bounded?
>
> Honza
>
Hi Jan,

Sorry for the delay in response, I was on leave for the last few days.

So as per my understanding, after this patch, the only path where we
would need to traverse all the PAs is the ext4_discard_preallocations()
call where we discard all the PAs of an inode one by one (eg when
closing the file etc). Such a discard is a colder path as we don't
usually expect to do it as often as say allocating blocks to an inode.

Originally, the limit was added in this patch [1] because of the time
lost in O(N) traversal in the allocation path (ext4_mb_use_preallocated
and ext4_mb_normalize_request). Since the rbtree addressed this
scalability issue we had decided to remove the trim logic in this
patchset.

[1]
https://lore.kernel.org/all/d7a98178-056b-6db5-6bce-4ead23f4a257@xxxxxxxxx/

That being said, I do agree that there should be some way to limit the
PAs from taking up an unreasonable amount of buddy space, memory and CPU
cycles in use cases like database files and disk files of long running
VMs. Previously the limit was 512 PAs per inode and trim was happening
in an LRU fashion, which is not very straightforward to implement in
trees.

Another approach is rather than having a hard limit, we can throttle the
PAs based on some parameter like total active PAs in FS or FSUtil% of
the PAs but we might need to take care of fairness so one inode is not
holding all the PAs while others get throttled.

Anyways, I think the trimming part would need some brainstorming to get
right so just wondering if we could keep that as part of a separate
patchset and remove the trimming logic for now since rbtree has
addressed the scalability concerns in allocation path.

Do let me know your thoughts on this.

Regards,
Ojaswin