Knowledge: Disk Performance Hints & Tips


Use I/O Threads for best disk performance

Per default, all I/O requests are handled in a single event loop within QEMU's main thread. Virtual servers with more than one virtual CPU running I/O intensive workloads can therefore experience lock contentions leading to noticeable performance impacts.
Using separate threads for I/O event handling can significantly improve throughput of virtual disks. I/O threads must be allocated explicitly and disks must be associated to them. The allocation of I/O threads is requested with the iothread tag in the libvirt domain XML.
E.g.
   <domain>
      <iothreads>2</iothreads>
      ...
   </domain>
will allocate 2 I/O threads for the QEMU process, and
   <devices>
      <disk type='block' device='disk>
         <driver name='qemu' type='raw' iothread='2'/>
  ...
      </disk>
      ...
   <devices>
will assign the disk to the I/O thread number 2.
Note that the gain in I/O performance comes at the cost of CPU consumption. Therefore the number of I/O threads and the distribution to virtual disks needs to be chosen considerately.
Rules of thumb:
  • The number of I/O threads should not exceed the number of host CPUs.
  • Over-provisioning of I/O threads should be avoided: A good starting point would be to have one I/O thread for every two to three virtual disks.
  • Even a single I/O thread will instantly improve the overall I/O performance compared to default behavior and should therefore always be configured.

Choosing the right AIO mode

In order to achieve the best possible throughput, QEMU performs disk I/O operations asynchronously, either
  • through a pool of userspace threads (not to be confused with I/O threads), or
  • by means of Linux kernel AIO (Asynchronous I/O).
By default, the userspace method is used, which is supposed to work in all environments. However, it is not as efficient as kernel AIO.
If the virtual disks are backed by block devices, raw file images or pre-allocated QCOW2 images, it is recommended to use kernel AIO, which can be enabled using the following libvirt XML snippet:
   <devices>
      <disk [...]>
         <driver name='qemu' format='raw' io='native' cache='none'/>
         ...
      </disk>
      ...
   </devices>
Note the cache='none' attribute that should always be specified together with io='native' to prevent QEMU from falling back to userspace AIO. Also it might be necessary to increase the system limit for asynchronous I/O requests, see this article.
When space efficient image files are used (QCOW2 without pre-allocation, or sparse raw images) the default of io='threads' may be better suited. This is because writing to not yet allocated sectors may temporarily block the virtual CPU and thus decrease I/O performance.

13 comments:

  1. Hello Stefan,

    I noticed that by read numbers are better by running a microbenchmark, say fio. But, the write numbers aren't that great. Do you have any recommendations for boosting write numbers?

    thanks.

    ReplyDelete
    Replies
    1. Hi,
      Are you using QCOW images? These are known to be slow on first write as the underlying sectors are allocated on demand. Writes that do not demand additional allocations should perform fine.
      One way to work around this is to use the "preallocation" option (see the qemu-img man page) when creating the QEMU image.

      Delete
  2. Hi Stefan,

    Thanks for this tut! I'm wondering if it's better to use iothreads or io-native? I'm using ssd and a raw file as VM's disk

    ReplyDelete
  3. Note that these options are not mutually exclusive, you can use both options at the same time! And we actually recommend to use io=native, cache=none and iothread together. However, there is no general guidance on how many iothreads to define - too many will hurt, so you might need to experiment a bit for optimum performance.

    ReplyDelete
  4. Hi Stefan. Do you still maintain this blog? I'm desperate for some support with qemu disk types - I'd love your help if you could spare me 5 minutes!

    ReplyDelete
  5. Yes, the blog is still active - feel free to post any questions!

    ReplyDelete
  6. When I try to use


    I get error: unsupported configuration: IOThreads not available for bus scsi target vda

    rpm -qa | grep qemu
    qemu-kvm-common-ev-2.12.0-44.1.el7_8.1.x86_64
    qemu-kvm-ev-2.12.0-44.1.el7_8.1.x86_64

    ReplyDelete
    Replies
    1. You need to define the IOThreads for the adapter, not for the disks! See https://www.ibm.com/support/knowledgecenter/linuxonibm/com.ibm.linux.z.ldva/ldva_t_configuringFCPDevices.html

      Delete
  7. is there a way to also define the cache size in xml as below qemu argument: -device virtio-blk-pci,drive=disk0,iothread=io1 \
    -drive if=none,id=disk0,cache=none,aio=native,file=xrv9k-fullk9-x.vrr-7.2.1.qcow2,cache-size=16M \

    ReplyDelete
  8. what i understood is that the cache-size helps to improve performance, and i have that tested in qemu. The difficulty is to find the equivalent XML configuration

    ReplyDelete
    Replies
    1. Hi guan,
      We don't have any experience with that setting. There is metadata_cache, which you might want to check out - but, again, that's something we would endorse for lack of experience.

      Delete
  9. Hello, I'd like to use native io with a sparse raw image if possible. However you're right a vm hits massive latency when the OS needs to expand the file. Is there any way to virsh-sparsify --inplace file.raw in a way that adds some amount of padding in the process? This would be the best of all worlds - low overhead of raw/direct io plus typically avoiding waits for file expansion. Thanks for a great article.

    ReplyDelete
    Replies
    1. We wouldn't know of such a feature. However, having padding small enough to not require too much additional storage, yet large enough to fit in future expansions, would certainly be a challenge.

      Delete