Discussion:
[Xen-users] Xen IO performance issues
marki
2018-09-14 11:04:34 UTC
Permalink
Hi,

We're having trouble with a dd "benchmark". Even though that probably
doesn't mean much since multiple concurrent jobs using a benckmark like
FIO for example work ok, I'd like to understand where the bottleneck is
/ why this behaves differently.

In ESXi it looks like the following and speed is high: (iostat output
below)

(kernel 4.4)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-5 0.00 0.00 0.00 512.00 0.00 512.00
2048.00 142.66 272.65 0.00 272.65 1.95 100.00
sdb 0.00 0.00 0.00 512.00 0.00 512.00
2048.00 141.71 270.89 0.00 270.89 1.95 100.00

# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
8192000000 bytes (8.2 GB, 7.6 GiB) copied, 9.70912 s, 844 MB/s

Now in a Xen DomU running kernel 4.4 it looks like the following and
speed is low / not what we're used to:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-0 0.00 0.00 0.00 100.00 0.00 99.00
2027.52 1.45 14.56 0.00 14.56 10.00 100.00
xvdb 0.00 0.00 0.00 2388.00 0.00 99.44
85.28 11.74 4.92 0.00 4.92 0.42 99.20

# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
1376059392 bytes (1.4 GB, 1.3 GiB) copied, 7.09965 s, 194 MB/s

Note the low queue depth on the LVM device and additionally the low
request size on the virtual disk.

(As in the ESXi VM there's an LVM layer inside the DomU but it doesn't
matter whether it's there or not.)



Inside Dom0 it looks like this:

This is the VHD:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-13 0.00 0.00 0.00 2638.00 0.00 105.72
82.08 11.67 4.42 0.00 4.42 0.36 94.00

This is the SAN:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-0 0.00 2423.00 0.00 216.00 0.00 105.71
1002.26 0.95 4.39 0.00 4.39 4.35 94.00

And these are the individual paths on the SAN (multipathing):

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdg 0.00 0.00 0.00 108.00 0.00 53.09
1006.67 0.50 4.63 0.00 4.63 4.59 49.60
sdl 0.00 0.00 0.00 108.00 0.00 52.62
997.85 0.44 4.04 0.00 4.04 4.04 43.60



The above applies to HV + HVPVM modes using kernel 4.4 in the DomU.

The following applies to a HV or PV DomU running kernel 3.12:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-1 0.00 0.00 41.00 7013.00 0.73 301.16
87.65 142.78 20.44 5.17 20.53 0.14 100.00
xvdb 0.00 0.00 41.00 7023.00 0.73 301.59
87.65 141.80 20.27 5.17 20.36 0.14 100.00

(Which is better but still not great.)



Any explanations on this one?

Thanks

marki
marki
2018-09-19 19:19:11 UTC
Permalink
Post by marki
Hi,
We're having trouble with a dd "benchmark". Even though that probably
doesn't mean much since multiple concurrent jobs using a benckmark
like FIO for
example work ok, I'd like to understand where the bottleneck is / why
this behaves differently.
In ESXi it looks like the following and speed is high: (iostat output
below)
(kernel 4.4)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-5              0.00     0.00    0.00  512.00     0.00   512.00 
2048.00   142.66  272.65    0.00  272.65   1.95 100.00
sdb               0.00     0.00    0.00  512.00     0.00   512.00 
2048.00   141.71  270.89    0.00  270.89   1.95 100.00
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
8192000000 bytes (8.2 GB, 7.6 GiB) copied, 9.70912 s, 844 MB/s
Now in a Xen DomU running kernel 4.4 it looks like the following and
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0.00     0.00    0.00  100.00     0.00    99.00 
2027.52     1.45   14.56    0.00   14.56  10.00 100.00
xvdb              0.00     0.00    0.00 2388.00     0.00    99.44   
85.28    11.74    4.92    0.00    4.92   0.42  99.20
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
1376059392 bytes (1.4 GB, 1.3 GiB) copied, 7.09965 s, 194 MB/s
Note the low queue depth on the LVM device and additionally the low
request size on the virtual disk.
(As in the ESXi VM there's an LVM layer inside the DomU but it doesn't
matter whether it's there or not.)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-13             0.00     0.00    0.00 2638.00     0.00   105.72   
82.08    11.67    4.42    0.00    4.42   0.36  94.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0.00  2423.00    0.00  216.00     0.00   105.71 
1002.26     0.95    4.39    0.00    4.39   4.35  94.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg               0.00     0.00    0.00  108.00     0.00    53.09 
1006.67     0.50    4.63    0.00    4.63   4.59  49.60
sdl               0.00     0.00    0.00  108.00     0.00    52.62  
997.85     0.44    4.04    0.00    4.04   4.04  43.60
The above applies to HV + HVPVM modes using kernel 4.4 in the DomU.
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-1              0.00     0.00   41.00 7013.00     0.73   301.16   
87.65   142.78   20.44    5.17   20.53   0.14 100.00
xvdb              0.00     0.00   41.00 7023.00     0.73   301.59   
87.65   141.80   20.27    5.17   20.36   0.14 100.00
(Which is better but still not great.)
Any explanations on this one?
If you figure it out let us know, it's been on my todo list to work on
for a bit now.
--Sarah
Hey,

Well, it's the stupid ring buffer with 11 slots with 4 kB each between
domU and dom0. This gives a maximum request size of 88 sectors (0,5 kB
each) = 44 kB.

What's clear is that for modern storage like SSD arrays or NVMe disks,
this simply won't cut it and Xen is a no-go...

I'd love if someone could tell me something different and/or how to
optimize.

What still remains to be answered is the additional issue with low queue
size (avgqu-sz).

From your response I guess this may need to go to the Dev list instead
of here (since noone seems to have a clue about obvious
questions/benchmarks).
I wonder what kind of workloads people run on Xen. Can't be much =D

Bye,
Marki
Hans van Kranenburg
2018-09-19 19:43:38 UTC
Permalink
Hi,
Post by marki
Post by marki
Hi,
We're having trouble with a dd "benchmark". Even though that probably
doesn't mean much since multiple concurrent jobs using a benckmark
like FIO for
example work ok, I'd like to understand where the bottleneck is / why
this behaves differently.
In ESXi it looks like the following and speed is high: (iostat output
below)
(kernel 4.4)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-5              0.00     0.00    0.00  512.00     0.00   512.00 
2048.00   142.66  272.65    0.00  272.65   1.95 100.00
sdb               0.00     0.00    0.00  512.00     0.00   512.00 
2048.00   141.71  270.89    0.00  270.89   1.95 100.00
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
8192000000 bytes (8.2 GB, 7.6 GiB) copied, 9.70912 s, 844 MB/s
Now in a Xen DomU running kernel 4.4 it looks like the following and
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0.00     0.00    0.00  100.00     0.00    99.00 
2027.52     1.45   14.56    0.00   14.56  10.00 100.00
xvdb              0.00     0.00    0.00 2388.00     0.00    99.44   
85.28    11.74    4.92    0.00    4.92   0.42  99.20
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
1376059392 bytes (1.4 GB, 1.3 GiB) copied, 7.09965 s, 194 MB/s
Interesting.

* Which Xen version are you using?
* Which Linux kernel version is being used in the dom0?
* Is this a PV, HVM or PVH guest?
* ...more details you can share?
Post by marki
Post by marki
Note the low queue depth on the LVM device and additionally the low
request size on the virtual disk.
(As in the ESXi VM there's an LVM layer inside the DomU but it
doesn't matter whether it's there or not.)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-13             0.00     0.00    0.00 2638.00     0.00   105.72   
82.08    11.67    4.42    0.00    4.42   0.36  94.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0.00  2423.00    0.00  216.00     0.00   105.71 
1002.26     0.95    4.39    0.00    4.39   4.35  94.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg               0.00     0.00    0.00  108.00     0.00    53.09 
1006.67     0.50    4.63    0.00    4.63   4.59  49.60
sdl               0.00     0.00    0.00  108.00     0.00    52.62  
997.85     0.44    4.04    0.00    4.04   4.04  43.60
The above applies to HV + HVPVM modes using kernel 4.4 in the DomU.
Do you mean PV and PVHVM, instead?
Post by marki
Post by marki
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-1              0.00     0.00   41.00 7013.00     0.73   301.16   
87.65   142.78   20.44    5.17   20.53   0.14 100.00
xvdb              0.00     0.00   41.00 7023.00     0.73   301.59   
87.65   141.80   20.27    5.17   20.36   0.14 100.00
(Which is better but still not great.)
Any explanations on this one?
What happens when you use a recent linux kernel in the guest, like 4.18?

Do things like using blk-mq make a difference here (just guessing around)?
Post by marki
If you figure it out let us know, it's been on my todo list to work on
for a bit now.
--Sarah
Hey,
Well, it's the stupid ring buffer with 11 slots with 4 kB each between
domU and dom0. This gives a maximum request size of 88 sectors (0,5 kB
each) = 44 kB.
What's clear is that for modern storage like SSD arrays or NVMe disks,
this simply won't cut it and Xen is a no-go...
I'd love if someone could tell me something different and/or how to
optimize.
What still remains to be answered is the additional issue with low queue
size (avgqu-sz).
From your response I guess this may need to go to the Dev list instead
of here (since noone seems to have a clue about obvious
questions/benchmarks).
I wonder what kind of workloads people run on Xen. Can't be much =D
These kind of remarks do not really help much if your goal would be to
motivate other people to think about these things together with you, get
better understanding and maybe find out things that can help all of us.

Thanks,
Hans

P.S. https://en.wikipedia.org/wiki/Warnock%27s_dilemma
Charles Gonçalves
2018-09-20 05:07:06 UTC
Permalink
I work with TPCx-V benchmark in Xen and even using pv I've noted that the
overall benchmark performance was drastically small than using the same
setup with KVM.

Did not dig further into this since was not relevant for my work. But the
reason could be same ....

On Sep 19, 2018 16:46, "Hans van Kranenburg" <***@knorrie.org> wrote:

Hi,
Post by marki
Post by marki
Hi,
We're having trouble with a dd "benchmark". Even though that probably
doesn't mean much since multiple concurrent jobs using a benckmark
like FIO for
example work ok, I'd like to understand where the bottleneck is / why
this behaves differently.
In ESXi it looks like the following and speed is high: (iostat output
below)
(kernel 4.4)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-5 0.00 0.00 0.00 512.00 0.00 512.00
2048.00 142.66 272.65 0.00 272.65 1.95 100.00
sdb 0.00 0.00 0.00 512.00 0.00 512.00
2048.00 141.71 270.89 0.00 270.89 1.95 100.00
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
8192000000 bytes (8.2 GB, 7.6 GiB) copied, 9.70912 s, 844 MB/s
Now in a Xen DomU running kernel 4.4 it looks like the following and
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-0 0.00 0.00 0.00 100.00 0.00 99.00
2027.52 1.45 14.56 0.00 14.56 10.00 100.00
xvdb 0.00 0.00 0.00 2388.00 0.00 99.44
85.28 11.74 4.92 0.00 4.92 0.42 99.20
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
1376059392 bytes (1.4 GB, 1.3 GiB) copied, 7.09965 s, 194 MB/s
Interesting.

* Which Xen version are you using?
* Which Linux kernel version is being used in the dom0?
* Is this a PV, HVM or PVH guest?
* ...more details you can share?
Post by marki
Post by marki
Note the low queue depth on the LVM device and additionally the low
request size on the virtual disk.
(As in the ESXi VM there's an LVM layer inside the DomU but it
doesn't matter whether it's there or not.)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-13 0.00 0.00 0.00 2638.00 0.00 105.72
82.08 11.67 4.42 0.00 4.42 0.36 94.00
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-0 0.00 2423.00 0.00 216.00 0.00 105.71
1002.26 0.95 4.39 0.00 4.39 4.35 94.00
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sdg 0.00 0.00 0.00 108.00 0.00 53.09
1006.67 0.50 4.63 0.00 4.63 4.59 49.60
sdl 0.00 0.00 0.00 108.00 0.00 52.62
997.85 0.44 4.04 0.00 4.04 4.04 43.60
The above applies to HV + HVPVM modes using kernel 4.4 in the DomU.
Do you mean PV and PVHVM, instead?
Post by marki
Post by marki
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
dm-1 0.00 0.00 41.00 7013.00 0.73 301.16
87.65 142.78 20.44 5.17 20.53 0.14 100.00
xvdb 0.00 0.00 41.00 7023.00 0.73 301.59
87.65 141.80 20.27 5.17 20.36 0.14 100.00
(Which is better but still not great.)
Any explanations on this one?
What happens when you use a recent linux kernel in the guest, like 4.18?

Do things like using blk-mq make a difference here (just guessing around)?
Post by marki
If you figure it out let us know, it's been on my todo list to work on
for a bit now.
--Sarah
Hey,
Well, it's the stupid ring buffer with 11 slots with 4 kB each between
domU and dom0. This gives a maximum request size of 88 sectors (0,5 kB
each) = 44 kB.
What's clear is that for modern storage like SSD arrays or NVMe disks,
this simply won't cut it and Xen is a no-go...
I'd love if someone could tell me something different and/or how to
optimize.
What still remains to be answered is the additional issue with low queue
size (avgqu-sz).
From your response I guess this may need to go to the Dev list instead
of here (since noone seems to have a clue about obvious
questions/benchmarks).
I wonder what kind of workloads people run on Xen. Can't be much =D
These kind of remarks do not really help much if your goal would be to
motivate other people to think about these things together with you, get
better understanding and maybe find out things that can help all of us.

Thanks,
Hans

P.S. https://en.wikipedia.org/wiki/Warnock%27s_dilemma
marki
2018-09-20 09:49:57 UTC
Permalink
Hello,
Post by Hans van Kranenburg
Post by marki
Hi,
We're having trouble with a dd "benchmark". Even though that probably
doesn't mean much since multiple concurrent jobs using a benckmark
like FIO for
example work ok, I'd like to understand where the bottleneck is / why
this behaves differently.
Now in a Xen DomU running kernel 4.4 it looks like the following and
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0.00     0.00    0.00  100.00     0.00    99.00 
2027.52     1.45   14.56    0.00   14.56  10.00 100.00
xvdb              0.00     0.00    0.00 2388.00     0.00    99.44   
85.28    11.74    4.92    0.00    4.92   0.42  99.20
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
1376059392 bytes (1.4 GB, 1.3 GiB) copied, 7.09965 s, 194 MB/s
Interesting.
* Which Xen version are you using?
That particular version was XenServer 7.1 LTSR (Citrix). We also tried
the newer current release 7.6, makes no difference.
Before you start screaming:
XS eval licenses do not contain any support so we can't ask them.
People in Citrix discussion forums are nice but don't seem to know
details necessary to solve this.
Post by Hans van Kranenburg
* Which Linux kernel version is being used in the dom0?
In 7.1 it is "4.4.0+2".
In 7.6 that would be "4.4.0+10".
Post by Hans van Kranenburg
* Is this a PV, HVM or PVH guest?
In any case blkfront (and thus blkback) were being used (which seems to
transfer data by that ring structure I mentioned and which explains the
small block size albeit not necessarily the low queue depth).
Post by Hans van Kranenburg
* ...more details you can share?
Well, not much more except that we are talking about Suse Enterprise
Linux 12 up to SP3 in the DomU here. We also tried RHEL 7.5 and the
result (slow single-threaded writes) was the same. Reads are not
blazingly fast either BTW.
Post by Hans van Kranenburg
Post by marki
Note the low queue depth on the LVM device and additionally the low
request size on the virtual disk.
(As in the ESXi VM there's an LVM layer inside the DomU but it
doesn't matter whether it's there or not.)
The above applies to HV + HVPVM modes using kernel 4.4 in the DomU.
Do you mean PV and PVHVM, instead?
Oups yes, in any case blkfront (and thus blkback) were being used.
Post by Hans van Kranenburg
What happens when you use a recent linux kernel in the guest, like 4.18?
I'd have to get back to you on that. However, as long as blkback stays
the same I'm not sure what would happen.
In any case we'd want to stick with the OSes that the XS people support,
I'll have to find out if there are some with more recent kernels than
SLES or RHEL.
Post by Hans van Kranenburg
Do things like using blk-mq make a difference here (just guessing around)?
Honestly I'd have to find out first what that is. I'll check that out
and will get back to you.

Best regards,
Marki
marki
2018-09-20 12:35:02 UTC
Permalink
Post by marki
Hello,
Post by Hans van Kranenburg
Do things like using blk-mq make a difference here (just guessing around)?
Honestly I'd have to find out first what that is. I'll check that out
and will get back to you.
Even though something seems to exist, it doesn't look like it's active
by default

# cat /sys/block/xvda/mq/0/active
0

There also does not seem to exist a parameter allowing to enable it:

# l /sys/module/xen_blkfront/parameters/
total 0
drwxr-xr-x 2 root root 0 Sep 20 13:28 ./
drwxr-xr-x 7 root root 0 Sep 20 13:28 ../
-r--r--r-- 1 root root 4096 Sep 20 13:28 max_indirect_segments
-r--r--r-- 1 root root 4096 Sep 20 13:28 max_ring_page_order

All other layers have it enabled alright:

# cat
/sys/devices/pci0000:00/0000:00:01.1/ata1/host0/scsi_host/host0/use_blk_mq
1
# cat
/sys/devices/pci0000:00/0000:00:01.1/ata2/host1/scsi_host/host1/use_blk_mq
1
# cat /sys/module/scsi_mod/parameters/use_blk_mq
Y
# cat /sys/module/dm_mod/parameters/use_blk_mq
Y

I have tried guessing and setting xen_blkfront.use_blk_mq=1 in the
kernel parameters. No change.

# cat /sys/block/xvda/mq/0/active
0

I now also tried an Ubuntu 18 DomU (Kernel 4.15). Makes no differences.
Except for iostat now showing the actual request size in kilobytes (44)
and no longer in sectors (88).


BR,
Marki
Juergen Gross
2018-09-28 08:46:48 UTC
Permalink
Post by marki
Hello,
Post by Hans van Kranenburg
Post by marki
Hi,
We're having trouble with a dd "benchmark". Even though that probably
doesn't mean much since multiple concurrent jobs using a benckmark
like FIO for
example work ok, I'd like to understand where the bottleneck is / why
this behaves differently.
Now in a Xen DomU running kernel 4.4 it looks like the following and
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0.00     0.00    0.00  100.00     0.00    99.00 
2027.52     1.45   14.56    0.00   14.56  10.00 100.00
xvdb              0.00     0.00    0.00 2388.00     0.00    99.44   
85.28    11.74    4.92    0.00    4.92   0.42  99.20
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
1376059392 bytes (1.4 GB, 1.3 GiB) copied, 7.09965 s, 194 MB/s
Interesting.
* Which Xen version are you using?
That particular version was XenServer 7.1 LTSR (Citrix). We also tried
the newer current release 7.6, makes no difference.
XS eval licenses do not contain any support so we can't ask them.
People in Citrix discussion forums are nice but don't seem to know
details necessary to solve this.
Post by Hans van Kranenburg
* Which Linux kernel version is being used in the dom0?
In 7.1 it is "4.4.0+2".
In 7.6 that would be "4.4.0+10".
Post by Hans van Kranenburg
* Is this a PV, HVM or PVH guest?
In any case blkfront (and thus blkback) were being used (which seems to
transfer data by that ring structure I mentioned and which explains the
small block size albeit not necessarily the low queue depth).
Post by Hans van Kranenburg
* ...more details you can share?
Well, not much more except that we are talking about Suse Enterprise
Linux 12 up to SP3 in the DomU here. We also tried RHEL 7.5 and the
result (slow single-threaded writes) was the same. Reads are not
blazingly fast either BTW.
Post by Hans van Kranenburg
Post by marki
Note the low queue depth on the LVM device and additionally the low
request size on the virtual disk.
(As in the ESXi VM there's an LVM layer inside the DomU but it
doesn't matter whether it's there or not.)
The above applies to HV + HVPVM modes using kernel 4.4 in the DomU.
Do you mean PV and PVHVM, instead?
Oups yes, in any case blkfront (and thus blkback) were being used.
Post by Hans van Kranenburg
What happens when you use a recent linux kernel in the guest, like 4.18?
I'd have to get back to you on that. However, as long as blkback stays
the same I'm not sure what would happen.
In any case we'd want to stick with the OSes that the XS people support,
I'll have to find out if there are some with more recent kernels than
SLES or RHEL.
I have just done a small test for other purposes requiring to do reads
in a domU using blkfront/blkback. The data was cached in dom0, so the
only limiting factor was cpu/memory speed and the block ring interface
of Xen. I was able to transfer 1.8 GB/s on a laptop with a dual core
i7-4600M CPU @ 2.90GHz.

So I don't think the ring buffer interface is a real issue here.

Kernels (in domU and dom0) are 4.19-rc5, Xen is 4.12-unstable.

Using a standard SLE12-SP2 domU (kernel 4.4.121) with the same dom0
as in the test before returned the same result.


Juergen
Hans van Kranenburg
2018-09-28 13:35:16 UTC
Permalink
Post by Juergen Gross
Post by marki
Hello,
Post by Hans van Kranenburg
Post by marki
Hi,
We're having trouble with a dd "benchmark". Even though that probably
doesn't mean much since multiple concurrent jobs using a benckmark
like FIO for
example work ok, I'd like to understand where the bottleneck is / why
this behaves differently.
Now in a Xen DomU running kernel 4.4 it looks like the following and
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-0              0.00     0.00    0.00  100.00     0.00    99.00 
2027.52     1.45   14.56    0.00   14.56  10.00 100.00
xvdb              0.00     0.00    0.00 2388.00     0.00    99.44   
85.28    11.74    4.92    0.00    4.92   0.42  99.20
# dd if=/dev/zero of=/u01/dd-test-file bs=32k count=250000
1376059392 bytes (1.4 GB, 1.3 GiB) copied, 7.09965 s, 194 MB/s
Interesting.
* Which Xen version are you using?
That particular version was XenServer 7.1 LTSR (Citrix). We also tried
the newer current release 7.6, makes no difference.
XS eval licenses do not contain any support so we can't ask them.
People in Citrix discussion forums are nice but don't seem to know
details necessary to solve this.
Post by Hans van Kranenburg
* Which Linux kernel version is being used in the dom0?
In 7.1 it is "4.4.0+2".
In 7.6 that would be "4.4.0+10".
Post by Hans van Kranenburg
* Is this a PV, HVM or PVH guest?
In any case blkfront (and thus blkback) were being used (which seems to
transfer data by that ring structure I mentioned and which explains the
small block size albeit not necessarily the low queue depth).
Post by Hans van Kranenburg
* ...more details you can share?
Well, not much more except that we are talking about Suse Enterprise
Linux 12 up to SP3 in the DomU here. We also tried RHEL 7.5 and the
result (slow single-threaded writes) was the same. Reads are not
blazingly fast either BTW.
Post by Hans van Kranenburg
Post by marki
Note the low queue depth on the LVM device and additionally the low
request size on the virtual disk.
(As in the ESXi VM there's an LVM layer inside the DomU but it
doesn't matter whether it's there or not.)
The above applies to HV + HVPVM modes using kernel 4.4 in the DomU.
Do you mean PV and PVHVM, instead?
Oups yes, in any case blkfront (and thus blkback) were being used.
Post by Hans van Kranenburg
What happens when you use a recent linux kernel in the guest, like 4.18?
I'd have to get back to you on that. However, as long as blkback stays
the same I'm not sure what would happen.
In any case we'd want to stick with the OSes that the XS people support,
I'll have to find out if there are some with more recent kernels than
SLES or RHEL.
I have just done a small test for other purposes requiring to do reads
in a domU using blkfront/blkback. The data was cached in dom0, so the
only limiting factor was cpu/memory speed and the block ring interface
of Xen. I was able to transfer 1.8 GB/s on a laptop with a dual core
So I don't think the ring buffer interface is a real issue here.
Kernels (in domU and dom0) are 4.19-rc5, Xen is 4.12-unstable.
Using a standard SLE12-SP2 domU (kernel 4.4.121) with the same dom0
as in the test before returned the same result.
We also did some testing here, with Xen 4.11 and with Linux 4.17 in dom0
and domU.

Interesting background about optimizations in the past (which OP might
or might not have in its xen/linux):

1) Indirect descriptors

https://blog.xenproject.org/2013/08/07/indirect-descriptors-for-xen-pv-disks/

In linux, this is commit 402b27f9f2c22309d5bb285628765bc27b82fcf5
option got renamed to max_indirect_segments in commit
14e710fe7897e37762512d336ab081c57de579a4

2) Multi-queue support

https://lwn.net/Articles/633391/

We did a mixed random read / random write test with fio, null_blk in the
dom0 and fio in libaio / direct mode directly on the block device, so
that we're only stressing pushing data between dom0 and domU.

Example command:
fio --filename=/dev/xvdc --direct=1 --rw=randrw --ioengine=libaio
--bs=128k --numjobs=8 --iodepth=4 --runtime=20 --group_reporting
--name=max_indirect_segments-$(cat
/sys/module/xen_blkfront/parameters/max_indirect_segments)

-# grep .
/sys/module/xen_blkfront/parameters/*/sys/module/xen_blkfront/parameters/max_indirect_segments:32
/sys/module/xen_blkfront/parameters/max_queues:4
/sys/module/xen_blkfront/parameters/max_ring_page_order:0

max_indirect_segments-32: (groupid=0, jobs=8): err= 0: pid=1756: Fri Sep
28 14:55:47 2018
read : io=96071MB, bw=4803.3MB/s, iops=38426, runt= 20001msec
write: io=96232MB, bw=4811.4MB/s, iops=38490, runt= 20001msec

Combined that's almost 10 GB/s...

We tried changing the max_indirect_segments xen_blkfront option from the
default 32 to 64, 128 etc. Every time we tried the thing above with
bs=4k, bs=8k, bs=16k etc...

The outcome of the test is that upping the number for
max_indirect_segments does not change anything, and that the limiting
factor for the test is cpu in the domU (4 vcpu here).

That's interesting by itself, since perf top shows that most of the time
is spent doing xen_hypercall_xen_version... (Why??)

(random sample of live output):

Samples: 2M of event 'cpu-clock', 4000 Hz, Event count (approx.):
44091475561
Overhead Shared Object Symbol
54.26% [kernel] [k] xen_hypercall_xen_version
8.58% [kernel] [k] xen_hypercall_sched_op
3.17% [unknown] [.] 0x00007f2c75959717
2.57% [unknown] [.] 0x00007f2c759596ca
1.21% [kernel] [k] blk_queue_split
1.01% [unknown] [.] 0x000056097df10836
0.60% [kernel] [k] kmem_cache_alloc
0.57% [kernel] [k] do_io_submit

Adding more vcpu doesn't help, by the way, with 8 vcpu the domU becomes
a bit unresponsive, and I get only around 3371.4MB/s in/out of it.

When using some real disk instead of null_blk, numbers are of course a
lot lower, but yes, it seems the communication between blkback and
blkfront is not really the limiting factor here.

Well, real life workload is of course different than a test... I'm
thinking about trying out a different value for max_indirect_segments in
some places in production for a few days and see if there's any
difference, e.g. see if it helps doing more parallel IO when there's
much higher latency involved for small random reads.

Hans

Loading...