Re: [PATCH v4 3/7] mm: Allow filesystems to defer cmtime updates

From: Andy Lutomirski
Date: Wed Sep 04 2013 - 16:05:53 EST


On Wed, Sep 4, 2013 at 12:20 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Wed 04-09-13 10:54:50, Andy Lutomirski wrote:
>> >> @@ -1970,6 +1988,39 @@ int write_one_page(struct page *page, int wait)
>> >> }
>> >> EXPORT_SYMBOL(write_one_page);
>> >>
>> >> +void mapping_flush_cmtime(struct address_space *mapping)
>> >> +{
>> >> + if (mapping_test_clear_cmtime(mapping) &&
>> >> + mapping->a_ops->update_cmtime_deferred)
>> >> + mapping->a_ops->update_cmtime_deferred(mapping);
>> >> +}
>> >> +EXPORT_SYMBOL(mapping_flush_cmtime);
>> > Hum, is there a reason for update_cmtime_deferred() operation? I can
>> > hardly imagine anyone will want to do anything else than what
>> > inode_update_time_writable() does so why bother? You mention tmpfs & co.
>> > don't fit into your scheme well with which I agree so let's just keep
>> > file_update_time() in their page_mkwrite() operation. But I don't see a
>> > real need for avoiding the deferred cmtime logic...
>>
>> I think there might be odd corner cases. For example, mmap a tmpfs
>> file, write it, and unmap it. Then, an hour later, maybe the system
> If you unmap it then that will handle the update. But if you won't unmap,
> you'd get spurious updates of timestamps which would be strange.
>
>> will be under memory pressure and page out the file. This could
>> trigger a surprising time update. (I'm not sure this can actually
>> happen on tmpfs, but maybe it would on some other filesystem.)
>>
>> Does this actually matter? A flag to turn the feature on or off would
>> do the trick, but I don't think there's precedent for sticking a flag
>> in a_ops.
> Flag in a_ops is ugly. But you can have a flag in 'struct
> filesystem_type' which would be reasonable.

OK, will do.

>
>> >> +void mapping_flush_cmtime_nowb(struct address_space *mapping)
>> >> +{
>> >> + /*
>> >> + * We get called from munmap and msync. Both calls can race
>> >> + * with fs freezing. If the fs is frozen after
>> >> + * mapping_test_clear_cmtime but before the time update, then
>> >> + * sync_filesystem will miss the cmtime update (because we
>> >> + * just cleared it) and we don't be able to write (because the
>> >> + * fs is frozen). On the other hand, we can't just return if
>> >> + * we're in the SB_FREEZE_PAGEFAULT state because our caller
>> >> + * expects the timestamp to be synchronously updated. So we
>> >> + * get write access without blocking, at the SB_FREEZE_FS
>> >> + * level. If the fs is already fully frozen, then we already
>> >> + * know we have nothing to do.
>> >> + */
>> >> +
>> >> + if (!mapping_test_cmtime(mapping))
>> >> + return; /* Optimization: nothing to do. */
>> >> +
>> >> + if (__sb_start_write(mapping->host->i_sb, SB_FREEZE_FS, false)) {
>> >> + mapping_flush_cmtime(mapping);
>> >> + __sb_end_write(mapping->host->i_sb, SB_FREEZE_FS);
>> >> + }
>> >> +}
>> > This is wrong because SB_FREEZE_FS level is targetted for filesystem
>> > internal use. Also it is racy. mapping_flush_cmtime() ends up calling
>> > mark_inode_dirty() and filesystems such as ext4 or xfs will start a
>> > transaction to store inode in the journal. This gets freeze protection at
>> > SB_FREEZE_FS level again. If freeze_super() sets s_writers.frozen to
>> > SB_FREEZE_FS before this second protection, things will deadlock.
>>
>> Whoops -- I assumed that it was safe to recursively take freeze
>> protection at the same level.
>>
>> I'm worried about the following race:
>>
>> Thread 1 (in munmap):
>> Check AS_CMTIME set
>> sb_start_pagefault
>>
>> Thread 2 (freezing the fs):
>> frozen = SB_FREEZE_PAGEFAULT;
>> sync_filesystem()
>>
>> Thread 1 is now stuck. It doesn't need to be, because sync_filesystem
>> will flush out the cmtime write. But there doesn't seem to be a clean
>> mechanism to wait for the freeze to finish.
> OK, I see. Frankly, I'd rather live with msync() and munmap() blocking
> while filesystem is frozen than trying to outsmart the freezing logic...
> If someone comes up with a usecase where it causes trouble, we can always
> improve the logic with some clever tricks.

I'll at least check that it's a shared writable mapping before doing
the flush to avoid blocking on other types of munmap.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/