Re: [PATCH v1 00/11] mm/kasan: support per-page shadow memory to reduce memory consumption

From: Joonsoo Kim
Date: Wed May 17 2017 - 03:23:44 EST


On Tue, May 16, 2017 at 01:49:10PM -0700, Dmitry Vyukov wrote:
> > Anyway, I have missed inline instrumentation completely.
> >
> > I will attach the fix in the bottom. It doesn't look beautiful
> > since it breaks layer design (some check will be done at report
> > function). However, I think that it's a good trade-off.
>
>
> I can confirm that inline works with that patch.

Thanks for confirming!

>
> I can also confirm that it reduces memory usage. I've booted qemu with
> 2G ram and run some fixed workload. Before:
> 31853 dvyukov 20 0 3043200 765464 21312 S 366.0 4.7 2:39.53
> qemu-system-x86
> 7528 dvyukov 20 0 3043200 732444 21676 S 333.3 4.5 2:23.19
> qemu-system-x86
> After:
> 6192 dvyukov 20 0 3043200 394244 20636 S 17.9 2.4 2:32.95
> qemu-system-x86
> 6265 dvyukov 20 0 3043200 388860 21416 S 399.3 2.4 3:02.88
> qemu-system-x86
> 9005 dvyukov 20 0 3043200 383564 21220 S 397.1 2.3 2:35.33
> qemu-system-x86
>
> However, I see some very significant slowdowns with inline
> instrumentation. I did 3 tests:
> 1. Boot speed, I measured time for a particular message to appear on
> console. Before:
> [ 2.504652] random: crng init done
> [ 2.435861] random: crng init done
> [ 2.537135] random: crng init done
> After:
> [ 7.263402] random: crng init done
> [ 7.263402] random: crng init done
> [ 7.174395] random: crng init done
>
> That's ~3x slowdown.
>
> 2. I've run bench_readv benchmark:
> https://raw.githubusercontent.com/google/sanitizers/master/address-sanitizer/kernel_buildbot/slave/bench_readv.c
> as:
> while true; do time ./bench_readv bench_readv 300000 1; done
>
> Before:
> sys 0m7.299s
> sys 0m7.218s
> sys 0m6.973s
> sys 0m6.892s
> sys 0m7.035s
> sys 0m6.982s
> sys 0m6.921s
> sys 0m6.940s
> sys 0m6.905s
> sys 0m7.006s
>
> After:
> sys 0m8.141s
> sys 0m8.077s
> sys 0m8.067s
> sys 0m8.116s
> sys 0m8.128s
> sys 0m8.115s
> sys 0m8.108s
> sys 0m8.326s
> sys 0m8.529s
> sys 0m8.164s
> sys 0m8.380s
>
> This is ~19% slowdown.
>
> 3. I've run bench_pipes benchmark:
> https://raw.githubusercontent.com/google/sanitizers/master/address-sanitizer/kernel_buildbot/slave/bench_pipes.c
> as:
> while true; do time ./bench_pipes 10 10000 1; done
>
> Before:
> sys 0m5.393s
> sys 0m6.178s
> sys 0m5.909s
> sys 0m6.024s
> sys 0m5.874s
> sys 0m5.737s
> sys 0m5.826s
> sys 0m5.664s
> sys 0m5.758s
> sys 0m5.421s
> sys 0m5.444s
> sys 0m5.479s
> sys 0m5.461s
> sys 0m5.417s
>
> After:
> sys 0m8.718s
> sys 0m8.281s
> sys 0m8.268s
> sys 0m8.334s
> sys 0m8.246s
> sys 0m8.267s
> sys 0m8.265s
> sys 0m8.437s
> sys 0m8.228s
> sys 0m8.312s
> sys 0m8.556s
> sys 0m8.680s
>
> This is ~52% slowdown.
>
>
> This does not look acceptable to me. I would ready to pay for this,
> say, 10% of performance. But it seems that this can have up to 2-4x
> slowdown for some workloads.

I found the reasons of above regression. There are two reasons.

1. In my implementation, original shadow to the memory allocated from
memblock is black shadow so it causes to call kasan_report(). It will
pass the check since per page shadow would be zero shadow but it
causes some overhead.

2. Memory used by stackdepot is in a similar situation with #1. It
allocates page and divide it to many objects. Then, use it like as
object. Although there is "KASAN_SANITIZE_stackdepot.o := n" which try
to disable sanitizer, there is a function call (memcmp() in
find_stack()) to other file and sanitizer work for it.

#1 problem can be fixed but more investigation is needed. I will
respin the series after fixing it.

#2 problem also can be fixed. There are two options here. First, uses
private memcmp() for stackdepot and disable sanitizer for it. I think
that this is a right approach since it slowdown the performance in all
KASAN build cases. And, we don't want to sanitize KASAN itself.
Second, I can provide a function to map the actual shadow manually. It
will reduce the case calling kasan_report().

See the attached patch. It implements later approach on #2 problem.
It would reduce performance regression. I have tested your bench_pipes
test with it and found that performance is restored. However, there is
still remaining problem, #1, so I'm not sure that it completely
restore your regression. Could you check that if possible?

Anyway, I think that respin is needed to fix this performance problem
completely.

>
>
> Your use-case is embed devices where you care a lot about both code
> size and memory consumption, right?

Yes.

> I see 2 possible ways forward:
> 1. Enable this new mode only for outline, but keep current scheme for
> inline. Then outline will be "small but slow" type of configuration.

Performance problem is not that bad in OUTLINE build. Therefore,
this is a reasonable option to have.

> 2. Somehow fix slowness (at least in inline mode).

I will try to fix slowness as much as possible. If slowness cannot be
acceptable after such effort, we can choose the direction at that
moment.

>
> > Mapping zero page to non-kernel memory could cause true-negative
> > problem since we cannot flush the TLB in all cpus. We will read zero
> > shadow value value in this case even if actual shadow value is not
> > zero. This is one of the reason that black page is introduced in this
> > patchset.
>
> What does make your current patch work then?
> Say we map a new shadow page, update the page shadow to say that there
> is mapped shadow. Then another CPU loads the page shadow and then
> loads from the newly mapped shadow. If we don't flush TLB, what makes
> the second CPU see the newly mapped shadow?

There is a fix-up processing to see the newly mapped shadow in other
cpus. check_memory_region_slow() exists for that purpose. In stale TLB
case, we will see black shadow and fall in to this function. In this
function, we flush stale TLB and re-check so we can see correct
result.

Thanks.