Re: [PATCH] time.c::timespec_trunc: fix nanosecond file time rounding

From: John Stultz
Date: Tue Jun 16 2015 - 19:08:22 EST


On Tue, Jun 16, 2015 at 3:39 PM, Karsten Blees <karsten.blees@xxxxxxxxx> wrote:
> Am 16.06.2015 um 19:07 schrieb John Stultz:
>> On Tue, Jun 9, 2015 at 10:36 AM, Karsten Blees <karsten.blees@xxxxxxxxx> wrote:
>>> From: Karsten Blees <blees@xxxxxxx>
>>> Date: Tue, 9 Jun 2015 10:50:28 +0200
>>>
>>> The rounding optimization in timespec_trunc() is based on the incorrect
>>> assumptions that current_kernel_time() is rounded to jiffies resolution,
>>> and that jiffies resolution is a multiple of all potential file time
>>> granularities.
>>
>> Sorry, this is a little opaque on the first read. You're saying that
>> there are filesystems where the on-disk granularity is smaller then a
>> tick/jiffy, but larger then a nanosecond, right?
>>
>
> Yes, examples include CIFS, NTFS (100 ns) and CEPH, UDF (1000 ns).

Thanks. Adding these concrete examples to the commit message would be good.


> The current code assumes that rounding can be avoided if (gran <= ns_per_tick).
>
> However, this optimization is only valid if:
>
> 1. current_kernel_time().tv_nsec is already rounded to tick resolution.
> E.g. with HZ=1000 you would get tv_nsec = 1000000, 2000000, 3000000, but
> never 1000001. AFAICT this is not true; current_kernel_time() may be
> incremented only once per tick, but its not rounded to tick resolution.
>
> 2. ns_per_tick is evenly divisible by gran, for all potential HZ and
> granularity values. IOW "(ns_per_tick % gran) == 0". This may have been
> true for HZ=100, 250, 1000, but not for HZ=300. E.g. if assumption 1
> above was true, HZ=300 would give you tv_nsec = 3333333, 6666666,
> 9999999... This would definitely need to be rounded to e.g. UDF
> resolution, even though (1000 <= 3333333) is clearly true.
>
>>> Thus, sub-second portions of in-core file times are not rounded to on-disk
>>> granularity. I.e. file times may change when the inode is re-read from disk
>>> or when the file system is remounted.
>>>
>>> File systems with on-disk resolutions of exactly 1 ns or 1 s are not
>>> affected by this.
>>>
>>> Steps to reproduce with e.g. UDF:
>>>
>>> $ dd if=/dev/zero of=udfdisk count=10000 && mkudffs udfdisk
>>> $ mkdir udf && mount udfdisk udf
>>> $ touch udf/test && stat -c %y udf/test
>>> 2015-06-09 10:22:56.130006767 +0200
>>> $ umount udf && mount udfdisk udf
>>> $ stat -c %y udf/test
>>> 2015-06-09 10:22:56.130006000 +0200
>>>
>>> Remounting rounds the mtime to 1Âs.
>>>
>>> Fix the rounding in timespec_trunc() and update the documentation.
>>>
>>> Note: This does _not_ fix the issue for FAT's 2 second mtime resolution,
>>> as struct super_block.s_time_gran isn't prepared to handle different
>>> ctime / mtime / atime resolutions nor resolutions > 1 second.
>>>
>>> Signed-off-by: Karsten Blees <blees@xxxxxxx>
>>> ---
>>>
>>> This issue came up in a recent discussion on the git ML about enabling
>>> nanosecond file times on Windows, see
>>>
>>> http://thread.gmane.org/gmane.comp.version-control.msysgit/21290/focus=21315
>>>
>>>
>>> kernel/time/time.c | 17 ++++-------------
>>> 1 file changed, 4 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/kernel/time/time.c b/kernel/time/time.c
>>> index 972e3bb..362ee06 100644
>>> --- a/kernel/time/time.c
>>> +++ b/kernel/time/time.c
>>> @@ -287,23 +287,14 @@ EXPORT_SYMBOL(jiffies_to_usecs);
>>> * @t: Timespec
>>> * @gran: Granularity in ns.
>>> *
>>> - * Truncate a timespec to a granularity. gran must be smaller than a second.
>>> - * Always rounds down.
>>> - *
>>> - * This function should be only used for timestamps returned by
>>> - * current_kernel_time() or CURRENT_TIME, not with do_gettimeofday() because
>>> - * it doesn't handle the better resolution of the latter.
>>> + * Truncate a timespec to a granularity. gran must not be greater than a
>>> + * second (10^9 ns). Always rounds down.
>>> */
>>> struct timespec timespec_trunc(struct timespec t, unsigned gran)
>>> {
>>> - /*
>>> - * Division is pretty slow so avoid it for common cases.
>>> - * Currently current_kernel_time() never returns better than
>>> - * jiffies resolution. Exploit that.
>>> - */
>>> - if (gran <= jiffies_to_usecs(1) * 1000) {
>>> + if (gran <= 1) {
>>> /* nothing */
>>
>> So this change will in effect, cause us to truncate where granularity
>> was less then one tick, where before we didn't do anything. Have you
>> reviewed all users to ensure this is safe (I assume you have, but it
>> might be good to describe which users are affected in the commit
>> message)?
>>
>>
>
> timespec_trunc() is exclusively used to calculate inode's [acm]time.
> It is mostly called through current_fs_time(), only a handful of fs
> drivers use it directly (but always with super_block.s_time_gran as
> second argument).
>
> So I think changing the function to do what the documentation says it
> does should be safe...

Yea, though existing behavior is often more "expected" then documented
behavior. :)


>
>>> - } else if (gran == 1000000000) {
>>> + } else if (gran >= 1000000000) {
>>> t.tv_nsec = 0;
>>
>> While the code (which is quite old) wasn't super intuitive, this looks
>> to be making it more subtle instead of more clear. So if the
>> granularity is larger then a second, we just truncate to a second?
>> That seems surprising. If handling granularity larger then a second
>> isn't supported, we should probably make that explicit and add a
>> WARN_ON to catch problematic users of the function.
>
> Indeed, I changed this to catch invalid arguments (similar to how
> "gran <= 1" catches 0 and thus prevents division by zero).
>
> What about this instead?
>
> if (gran == 1) {
> /* nothing */
> } else if (gran == 1000000000) {
> t.tv_nsec = 0;
> } else if (gran < 1 || gran > 1000000000) {
> WARN_ON(1);
> } else {
> t.tv_nsec -= t.tv_nsec % gran;
> }
> return t;

Logically its ok. I might suggest cleaning it up as:

if ((gran < 1) || (gran > NSEC_PER_SEC))
WARN_ON(1); /* catch invalid granularity values */
else if (gran == NSEC_PER_SEC)
t.tv_nsec = 0; /* special case to avoid div */
else if ((gran > 1) && ( gran < NSEC_PER_SEC))
t.tv_nsec -= t.tv_nsec % gran;
return t;

Also it would be good to make it clear in the function comment that
gran > NSEC_PER_SEC are invalid.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/