[Performance test]
We measured the tsc needed to the ioctl()s for getting dirty logs in
kernel.
Test environment
AMD Phenom(tm) 9850 Quad-Core Processor with 8GB memory
1. GUI test (running Ubuntu guest in graphical mode)
sudo qemu-system-x86_64 -hda dirtylog_test.img -boot c -m 4192 -net ...
We show a relatively stable part to compare how much time is needed
for the basic parts of dirty log ioctl.
get.org get.opt switch.opt
slots[7].len=32768 278379 66398 64024
slots[8].len=32768 181246 270 160
slots[7].len=32768 263961 64673 64494
slots[8].len=32768 181655 265 160
slots[7].len=32768 263736 64701 64610
slots[8].len=32768 182785 267 160
slots[7].len=32768 260925 65360 65042
slots[8].len=32768 182579 264 160
slots[7].len=32768 267823 65915 65682
slots[8].len=32768 186350 271 160
At a glance, we know our optimization improved significantly compared
to the original get dirty log ioctl. This is true for both get.opt and
switch.opt. This has a really big impact for the personal KVM users who
drive KVM in GUI mode on their usual PCs.
Next, we notice that switch.opt improved a hundred nano seconds or so for
these slots. Although this may sound a bit tiny improvement, we can feel
this as a difference of GUI's responses like mouse reactions.
To feel the difference, please try GUI on your PC with our patch series!
2. Live-migration test (4GB guest, write loop with 1GB buf)
We also did a live-migration test.
get.org get.opt switch.opt
slots[0].len=655360 797383 261144 222181
slots[1].len=3757047808 2186721 1965244 1842824
slots[2].len=637534208 1433562 1012723 1031213
slots[3].len=131072 216858 331 331
slots[4].len=131072 121635 225 164
slots[5].len=131072 120863 356 164
slots[6].len=16777216 121746 1133 156
slots[7].len=32768 120415 230 278
slots[8].len=32768 120368 216 149
slots[0].len=655360 806497 194710 223582
slots[1].len=3757047808 2142922 1878025 1895369
slots[2].len=637534208 1386512 1021309 1000345
slots[3].len=131072 221118 459 296
slots[4].len=131072 121516 272 166
slots[5].len=131072 122652 244 173
slots[6].len=16777216 123226 99185 149
slots[7].len=32768 121803 457 505
slots[8].len=32768 121586 216 155
slots[0].len=655360 766113 211317 213179
slots[1].len=3757047808 2155662 1974790 1842361
slots[2].len=637534208 1481411 1020004 1031352
slots[3].len=131072 223100 351 295
slots[4].len=131072 122982 436 164
slots[5].len=131072 122100 300 503
slots[6].len=16777216 123653 779 151
slots[7].len=32768 122617 284 157
slots[8].len=32768 122737 253 149
For slots other than 0,1,2 we can see the similar improvement.
Considering the fact that switch.opt does not depend on the bitmap length
except for kvm_mmu_slot_remove_write_access(), this is the cause of some
usec to msec time consumption: there might be some context switches.
But note that this was done with the workload which dirtied the memory
endlessly during the live-migration.
In usual workload, the number of dirty pages varies a lot for each iteration
and we should gain really a lot for relatively clean cases.