Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi

From: Kemeng Shi
Date: Sat Apr 06 2024 - 22:48:27 EST




on 4/3/2024 11:04 PM, Brian Foster wrote:
> On Wed, Apr 03, 2024 at 04:49:42PM +0800, Kemeng Shi wrote:
>>
>>
>> on 3/29/2024 9:10 PM, Brian Foster wrote:
>>> On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
>>>> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
>>>> of bdi.
>>>>
>>>
>>> Hi Kemeng,
>> Hello Brian,
>>>
>>> Just a few random thoughts/comments..
>>>
>>>> Following domain hierarchy is tested:
>>>> global domain (320G)
>>>> / \
>>>> cgroup domain1(10G) cgroup domain2(10G)
>>>> | |
>>>> bdi wb1 wb2
>>>>
>>>> /* per wb writeback info of bdi is collected */
>>>> cat /sys/kernel/debug/bdi/252:16/wb_stats
>>>> WbCgIno: 1
>>>> WbWriteback: 0 kB
>>>> WbReclaimable: 0 kB
>>>> WbDirtyThresh: 0 kB
>>>> WbDirtied: 0 kB
>>>> WbWritten: 0 kB
>>>> WbWriteBandwidth: 102400 kBps
>>>> b_dirty: 0
>>>> b_io: 0
>>>> b_more_io: 0
>>>> b_dirty_time: 0
>>>> state: 1
>>>
>>> Maybe some whitespace or something between entries would improve
>>> readability?
>> Sure, I will add a whitespace in next version.
>>>
>>>> WbCgIno: 4094
>>>> WbWriteback: 54432 kB
>>>> WbReclaimable: 766080 kB
>>>> WbDirtyThresh: 3094760 kB
>>>> WbDirtied: 1656480 kB
>>>> WbWritten: 837088 kB
>>>> WbWriteBandwidth: 132772 kBps
>>>> b_dirty: 1
>>>> b_io: 1
>>>> b_more_io: 0
>>>> b_dirty_time: 0
>>>> state: 7
>>>> WbCgIno: 4135
>>>> WbWriteback: 15232 kB
>>>> WbReclaimable: 786688 kB
>>>> WbDirtyThresh: 2909984 kB
>>>> WbDirtied: 1482656 kB
>>>> WbWritten: 681408 kB
>>>> WbWriteBandwidth: 124848 kBps
>>>> b_dirty: 0
>>>> b_io: 1
>>>> b_more_io: 0
>>>> b_dirty_time: 0
>>>> state: 7
>>>>
>>>> Signed-off-by: Kemeng Shi <shikemeng@xxxxxxxxxxxxxxx>
>>>> ---
>>>> include/linux/writeback.h | 1 +
>>>> mm/backing-dev.c | 88 +++++++++++++++++++++++++++++++++++++++
>>>> mm/page-writeback.c | 19 +++++++++
>>>> 3 files changed, 108 insertions(+)
>>>>
>>> ...
>>>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>>>> index 8daf950e6855..e3953db7d88d 100644
>>>> --- a/mm/backing-dev.c
>>>> +++ b/mm/backing-dev.c
>>>> @@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats,
>>>> }
>>>>
>>>> #ifdef CONFIG_CGROUP_WRITEBACK
>>> ...
>>>> +static int cgwb_debug_stats_show(struct seq_file *m, void *v)
>>>> +{
>>>> + struct backing_dev_info *bdi;
>>>> + unsigned long background_thresh;
>>>> + unsigned long dirty_thresh;
>>>> + struct bdi_writeback *wb;
>>>> + struct wb_stats stats;
>>>> +
>>>> + rcu_read_lock();
>>>> + bdi = lookup_bdi(m);
>>>> + if (!bdi) {
>>>> + rcu_read_unlock();
>>>> + return -EEXIST;
>>>> + }
>>>> +
>>>> + global_dirty_limits(&background_thresh, &dirty_thresh);
>>>> +
>>>> + list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
>>>> + memset(&stats, 0, sizeof(stats));
>>>> + stats.dirty_thresh = dirty_thresh;
>>>
>>> If you did something like the following here, wouldn't that also zero
>>> the rest of the structure?
>>>
>>> struct wb_stats stats = { .dirty_thresh = dirty_thresh };
>>>
>> Suer, will do it in next version.
>>>> + collect_wb_stats(&stats, wb);
>>>> +
>>>
>>> Also, similar question as before on whether you'd want to check
>>> WB_registered or something here..
>> Still prefer to keep full debug info and user could filter out on
>> demand.
>
> Ok. I was more wondering if that was needed for correctness. If not,
> then that seems fair enough to me.
For bdi->wb, it's unavailable after release_bdi. As bdi_debug_unregister
will block bdi_unregister, then release_bdi must not be reached yet and
it's safe to collect bdi->wb info.
For wb in cgroup, it's unavailable after cgwb_release_workfn, we could
prevent this with wb_tryget before collection.
So it's correct for per-wb stats but we add a extra wb_trget in
bdi stats in patch 2 and will do it in next version.
>
>>>
>>>> + if (mem_cgroup_wb_domain(wb) == NULL) {
>>>> + wb_stats_show(m, wb, &stats);
>>>> + continue;
>>>> + }
>>>
>>> Can you explain what this logic is about? Is the cgwb_calc_thresh()
>>> thing not needed in this case? A comment might help for those less
>>> familiar with the implementation details.
>> If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise,
>> it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget
>> and cgwb_calc_thresh. Will add some comment in next version.
>>>
>>> BTW, I'm also wondering if something like the following is correct
>>> and/or roughly equivalent:
>>>
>>> list_for_each_*(wb, ...) {
>>> struct wb_stats stats = ...;
>>>
>>> if (!wb_tryget(wb))
>>> continue;
>>>
>>> collect_wb_stats(&stats, wb);
>>>
>>> /*
>>> * Extra wb_thresh magic. Drop rcu lock because ... . We
>>> * can do so here because we have a ref.
>>> */
>>> if (mem_cgroup_wb_domain(wb)) {
>>> rcu_read_unlock();
>>> stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
>>> rcu_read_lock();
>>> }
>>>
>>> wb_stats_show(m, wb, &stats)
>>> wb_put(wb);
>>> }
>> It's correct as wb_tryget to bdi->wb has no harm. I have considered
>> to do it in this way, I change my mind to do it in new way for
>> two reason:
>> 1. Put code handling wb in cgroup more tight which could be easier
>> to maintain.
>> 2. Rmove extra wb_tryget/wb_put for wb in bdi.
>> Would this make sense to you?
>
> Ok, well assuming it is correct the above logic is a bit more simple and
> readable to me. I think you'd just need to fill in the comment around
> the wb_thresh thing rather than i.e. having to explain we don't need to
> ref bdi->wb even though it doesn't seem to matter.
>
> I kind of feel the same on the wb_stats file thing below just because it
> seems more consistent and available if wb_stats eventually grows more
> wb-specific data.
>
> That said, this is subjective and not hugely important so I don't insist
> on either point. Maybe wait a bit and see if Jan or Tejun or somebody
> has any thoughts..? If nobody else expresses explicit preference then
> I'm good with it either way.
Sure, I will wait for someday and decide the way used in next version.

Thanks so much for all the advise.

Kemeng
>
> Brian
>
>>>
>>>> +
>>>> + /*
>>>> + * cgwb_release will destroy wb->memcg_completions which
>>>> + * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent
>>>> + * memcg_completions destruction from cgwb_release.
>>>> + */
>>>> + if (!wb_tryget(wb))
>>>> + continue;
>>>> +
>>>> + rcu_read_unlock();
>>>> + /* cgwb_calc_thresh may sleep in cgroup_rstat_flush */
>>>> + stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
>>>> + wb_stats_show(m, wb, &stats);
>>>> + rcu_read_lock();
>>>> + wb_put(wb);
>>>> + }
>>>> + rcu_read_unlock();
>>>> +
>>>> + return 0;
>>>> +}
>>>> +DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
>>>> +
>>>> +static void cgwb_debug_register(struct backing_dev_info *bdi)
>>>> +{
>>>> + debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
>>>> + &cgwb_debug_stats_fops);
>>>> +}
>>>> +
>>>> static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>> struct wb_stats *stats)
>>>> {
>>>> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
>>>> {
>>>> collect_wb_stats(stats, &bdi->wb);
>>>> }
>>>> +
>>>> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
>>>
>>> Could we just create the wb_stats file regardless of whether cgwb is
>>> enabled? Obviously theres only one wb in the !CGWB case and it's
>>> somewhat duplicative with the bdi stats file, but that seems harmless if
>>> the same code can be reused..? Maybe there's also a small argument for
>>> dropping the state info from the bdi stats file and moving it to
>>> wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to
>> avoid unneed extra cost when CGWB is not enabled.
>> I think it's better to avoid extra cost from wb_stats when CGWB is not
>> enabled. For now, we only save cpu cost to create and destroy wb_stats
>> and save memory cost to record debugfs file, we could save more in
>> future when wb_stats records more debug info.
>> Move state info from bdi stats to wb_stats make senses to me. The only
>> concern would be compatibility problem. I will add a new patch to this
>> to make this more noticeable and easier to revert.
>> Thanks a lot for review!
>>
>> Kemeng
>>>
>>> Brian
>>>
>>>> #endif
>>>>
>>>> static int bdi_debug_stats_show(struct seq_file *m, void *v)
>>>> @@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>>>>
>>>> debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
>>>> &bdi_debug_stats_fops);
>>>> + cgwb_debug_register(bdi);
>>>> }
>>>>
>>>> static void bdi_debug_unregister(struct backing_dev_info *bdi)
>>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>>> index 0e20467367fe..3724c7525316 100644
>>>> --- a/mm/page-writeback.c
>>>> +++ b/mm/page-writeback.c
>>>> @@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>>>> return __wb_calc_thresh(&gdtc, thresh);
>>>> }
>>>>
>>>> +unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
>>>> +{
>>>> + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>>>> + struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
>>>> + unsigned long filepages, headroom, writeback;
>>>> +
>>>> + gdtc.avail = global_dirtyable_memory();
>>>> + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
>>>> + global_node_page_state(NR_WRITEBACK);
>>>> +
>>>> + mem_cgroup_wb_stats(wb, &filepages, &headroom,
>>>> + &mdtc.dirty, &writeback);
>>>> + mdtc.dirty += writeback;
>>>> + mdtc_calc_avail(&mdtc, filepages, headroom);
>>>> + domain_dirty_limits(&mdtc);
>>>> +
>>>> + return __wb_calc_thresh(&mdtc, mdtc.thresh);
>>>> +}
>>>> +
>>>> /*
>>>> * setpoint - dirty 3
>>>> * f(dirty) := 1.0 + (----------------)
>>>> --
>>>> 2.30.0
>>>>
>>>
>>>
>>
>
>