Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when do_checkpointfor better performance

From: Gu Zheng
Date: Thu Oct 10 2013 - 22:21:35 EST


Hi Jin,
On 10/11/2013 07:54 AM, Jin Xu wrote:

>> Date: Thu, 10 Oct 2013 16:11:53 +0800
>> From: guz.fnst@xxxxxxxxxxxxxx
>> To: jinuxstyle@xxxxxxxx
>> CC: yuan.mark.zhong@xxxxxxxxxxx; jaegeuk.kim@xxxxxxxxxxx; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-fsdevel@xxxxxxxxxxxxxxx; shu.tan@xxxxxxxxxxx
>> Subject: Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when do_checkpoint for better performance
>>
>> Hi Jin,
>>
>> On 10/10/2013 04:09 PM, Jin Xu wrote:
>>
>> > Hi Gu,
>> >
>> > I have a comment below.
>> >
>> >> Date: Wed, 9 Oct 2013 12:04:09 +0800
>> >> From: guz.fnst@xxxxxxxxxxxxxx
>> >> To: yuan.mark.zhong@xxxxxxxxxxx
>> >> CC: jaegeuk.kim@xxxxxxxxxxx; linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-fsdevel@xxxxxxxxxxxxxxx; shu.tan@xxxxxxxxxxx
>> >> Subject: Re: [f2fs-dev] [PATCH v2] f2fs: avoid congestion_wait when do_checkpoint for better performance
>> >>
>> >> Hi Yuan,
>> >> On 10/08/2013 07:30 PM, Yuan Zhong wrote:
>> >>
>> > ...
>> >>
>> >> Signed-off-by: Gu Zheng <guz.fnst@xxxxxxxxxxxxxx>
>> >> ---
>> >> fs/f2fs/checkpoint.c | 11 +++++++++--
>> >> fs/f2fs/f2fs.h | 1 +
>> >> fs/f2fs/segment.c | 4 ++++
>> >> 3 files changed, 14 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> >> index d808827..2a5999d 100644
>> >> --- a/fs/f2fs/checkpoint.c
>> >> +++ b/fs/f2fs/checkpoint.c
>> >> @@ -757,8 +757,15 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
>> >> f2fs_put_page(cp_page, 1);
>> >>
>> >> /* wait for previous submitted node/meta pages writeback */
>> >> - while (get_pages(sbi, F2FS_WRITEBACK))
>> >> - congestion_wait(BLK_RW_ASYNC, HZ / 50);
>> >> + sbi->cp_task = current;
>> >> + while (get_pages(sbi, F2FS_WRITEBACK)) {
>> >> + set_current_state(TASK_UNINTERRUPTIBLE);
>> >> + if (!get_pages(sbi, F2FS_WRITEBACK))
>> >> + break;
>> >> + io_schedule();
>> >> + }
>> >> + __set_current_state(TASK_RUNNING);
>> >> + sbi->cp_task = NULL;
>> >>
>> >> filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
>> >> filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
>> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> >> index a955a59..408ace7 100644
>> >> --- a/fs/f2fs/f2fs.h
>> >> +++ b/fs/f2fs/f2fs.h
>> >> @@ -365,6 +365,7 @@ struct f2fs_sb_info {
>> >> struct mutex writepages; /* mutex for writepages() */
>> >> int por_doing; /* recovery is doing or not */
>> >> int on_build_free_nids; /* build_free_nids is doing */
>> >> + struct task_struct *cp_task; /* checkpoint task */
>> >>
>> >> /* for orphan inode management */
>> >> struct list_head orphan_inode_list; /* orphan inode list */
>> >> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>> >> index bd79bbe..3b20359 100644
>> >> --- a/fs/f2fs/segment.c
>> >> +++ b/fs/f2fs/segment.c
>> >> @@ -597,6 +597,10 @@ static void f2fs_end_io_write(struct bio *bio, int err)
>> >>
>> >> if (p->is_sync)
>> >> complete(p->wait);
>> >> +
>> >> + if (!get_pages(p->sbi, F2FS_WRITEBACK) && p->sbi->cp_task)
>> >> + wake_up_process(p->sbi->cp_task);
>> >
>> > There is a risk of dereferencing a NULL pointer because here simply comparing the
>> > cp_task against NULL is not enough to avoid race in multi-thread environment.
>> > Another thread could have assigned it to NULL in the window between the comparison
>> > and waking up.
>>
>> Can not be that, checkpoint routine is always singleton and protected by cp_mutex and cp_rwsem.
>>
>
> The race could happen like this for example:
> On a SMP environment, thread 1 wakes up the checkpoint thread, then
> thread 2 comes to the f2fs_end_io_write, compared the cp_task as not NULL,
> but at the same time, the checkpoint thread just assigned the cp_task to NULL.
> When thread 2 gets to the wake_up_process, dereferencing to NULL pointer
> happens.

The case means that two or more IO are going on. If thread 1 wake up checkpoint, it'll check
get_pages(p->sbi, F2FS_WRITEBACK) before going down to assign the cp_task to NULL, so when
thread 2 gets to the wake_up_process the cp_task is still valid.
The "while (get_pages(sbi, F2FS_WRITEBACK))" loop is used to avoid this issue.

Thanks,
Gu

>
> Jin
>
>> Thanks,
>> Gu
>>
>> >
>> > Regards,
>> > Jin
>> >> +
>> >> kfree(p);
>> >> bio_put(bio);
>> >> }
>> >> --
>> >> 1.7.7
>> >>
>> >> Regards,
>> >> Gu
>> >>
>> >> >
>> >> >
>> >> >>> This is a problem here, especially, when sync a large number of small files or dirs.
>> >> >>> In order to avoid this, a wait_list is introduced,
>> >> >>> the checkpoint thread will be dropped into the wait_list if the pages have not been written back,
>> >> >>> and will be waked up by contrast.
>> >> >
>> >> >> Please pay some attention to the mail form, this mail is out of format in my mail client.
>> >> >
>> >> >> Regards,
>> >> >> Gu
>> >> >
>> >> > Regards,
>> >> > Yuan
>> >> >
>> >> >>>
>> >> >>> Signed-off-by: Yuan Zhong <yuan.mark.zhong@xxxxxxxxxxx>
>> >> >>> ---
>> >> >>> fs/f2fs/checkpoint.c | 3 +--
>> >> >>> fs/f2fs/f2fs.h | 19 +++++++++++++++++++
>> >> >>> fs/f2fs/segment.c | 1 +
>> >> >>> fs/f2fs/super.c | 1 +
>> >> >>> 4 files changed, 22 insertions(+), 2 deletions(-)
>> >> >>>
>> >> >>> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
>> >> >>> index ca39442..5d69ae0 100644
>> >> >>> --- a/fs/f2fs/checkpoint.c
>> >> >>> +++ b/fs/f2fs/checkpoint.c
>> >> >>> @@ -758,8 +758,7 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
>> >> >>> f2fs_put_page(cp_page, 1);
>> >> >>>
>> >> >>> /* wait for previous submitted node/meta pages writeback */
>> >> >>> - while (get_pages(sbi, F2FS_WRITEBACK))
>> >> >>> - congestion_wait(BLK_RW_ASYNC, HZ / 50);
>> >> >>> + f2fs_writeback_wait(sbi);
>> >> >>>
>> >> >>> filemap_fdatawait_range(sbi->node_inode->i_mapping, 0, LONG_MAX);
>> >> >>> filemap_fdatawait_range(sbi->meta_inode->i_mapping, 0, LONG_MAX);
>> >> >>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>> >> >>> index 7fd99d8..4b0d70e 100644
>> >> >>> --- a/fs/f2fs/f2fs.h
>> >> >>> +++ b/fs/f2fs/f2fs.h
>> >> >>> @@ -18,6 +18,8 @@
>> >> >>> #include <linux/crc32.h>
>> >> >>> #include <linux/magic.h>
>> >> >>> #include <linux/kobject.h>
>> >> >>> +#include <linux/wait.h>
>> >> >>> +#include <linux/sched.h>
>> >> >>>
>> >> >>> /*
>> >> >>> * For mount options
>> >> >>> @@ -368,6 +370,7 @@ struct f2fs_sb_info {
>> >> >>> struct mutex fs_lock[NR_GLOBAL_LOCKS]; /* blocking FS operations */
>> >> >>> struct mutex node_write; /* locking node writes */
>> >> >>> struct mutex writepages; /* mutex for writepages() */
>> >> >>> + wait_queue_head_t writeback_wqh; /* wait_queue for writeback */
>> >> >>> unsigned char next_lock_num; /* round-robin global locks */
>> >> >>> int por_doing; /* recovery is doing or not */
>> >> >>> int on_build_free_nids; /* build_free_nids is doing */
>> >> >>> @@ -961,6 +964,22 @@ static inline int f2fs_readonly(struct super_block *sb)
>> >> >>> return sb->s_flags & MS_RDONLY;
>> >> >>> }
>> >> >>>
>> >> >>> +static inline void f2fs_writeback_wait(struct f2fs_sb_info *sbi)
>> >> >>> +{
>> >> >>> + DEFINE_WAIT(wait);
>> >> >>> +
>> >> >>> + prepare_to_wait(&sbi->writeback_wqh, &wait, TASK_UNINTERRUPTIBLE);
>> >> >>> + if (get_pages(sbi, F2FS_WRITEBACK))
>> >> >>> + io_schedule();
>> >> >>> + finish_wait(&sbi->writeback_wqh, &wait);
>> >> >>> +}
>> >> >>> +
>> >> >>> +static inline void f2fs_writeback_wake(struct f2fs_sb_info *sbi)
>> >> >>> +{
>> >> >>> + if (!get_pages(sbi, F2FS_WRITEBACK))
>> >> >>> + wake_up_all(&sbi->writeback_wqh);
>> >> >>> +}
>> >> >>> +
>> >> >>> /*
>> >> >>> * file.c
>> >> >>> */
>> >> >>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>> >> >>> index bd79bbe..0708aa9 100644
>> >> >>> --- a/fs/f2fs/segment.c
>> >> >>> +++ b/fs/f2fs/segment.c
>> >> >>> @@ -597,6 +597,7 @@ static void f2fs_end_io_write(struct bio *bio, int err)
>> >> >>>
>> >> >>> if (p->is_sync)
>> >> >>> complete(p->wait);
>> >> >>> + f2fs_writeback_wake(p->sbi);
>> >> >>> kfree(p);
>> >> >>> bio_put(bio);
>> >> >>> }
>> >> >>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>> >> >>> index 094ccc6..3ac6d85 100644
>> >> >>> --- a/fs/f2fs/super.c
>> >> >>> +++ b/fs/f2fs/super.c
>> >> >>> @@ -835,6 +835,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent)
>> >> >>> mutex_init(&sbi->gc_mutex);
>> >> >>> mutex_init(&sbi->writepages);
>> >> >>> mutex_init(&sbi->cp_mutex);
>> >> >>> + init_waitqueue_head(&sbi->writeback_wqh);
>> >> >>> for (i = 0; i < NR_GLOBAL_LOCKS; i++)
>> >> >>> mutex_init(&sbi->fs_lock[i]);
>> >> >>> mutex_init(&sbi->node_write);N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±‘êçzX§¶›¡Ü¨}©ž²Æ zÚ&j:+v‰¨¾«‘êçzZ+€Ê+zf£¢·hšˆ§~†­†Ûiÿûàz¹®w¥¢¸?™¨è­Ú&¢)ߢf”ù^jÇ«y§m…á@A«a¶Úÿ 0¶ìh®å’i
>> >>
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/