[Xen-users] Xen 4.10: domU crashes during/after live-migrate

Discussion:

Pim van den Berg

2018-04-13 07:38:50 UTC

Hi all,

We (at Mendix) are upgrading our dom0s to Xen 4.10 (PV) running on Debian
Stretch (Linux 4.9), but we are running into an issue regarding live-migration.

We are experiencing domU crashes while live-migrating and in the seconds after
the live-migration has been completed. This doesn't happen all the time. But we
are able to reproduce the issue within 1 to max 10 times live migrating between
2 dom0s.

We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian
Stretch) and 4.15.11-1 (Debian Buster).

Attached are all kernel traces, oopses per crash that are logged from the domUs
and retrieved via "xen console" in the seconds after the live-migration is
completed. In some cases the domU keeps on running or being visible via "xen
list", in other cases the domU disappears from "xen list" after a short amount
of time.

From the logging in our dom0s in most cases everything looks fine:

Apr 12 16:58:20 altair socat[738]: migration target: Ready to receive domain.
Apr 12 16:58:20 altair socat[738]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 16:58:20 altair socat[738]: Savefile contains xl domain config in JSON format
Apr 12 16:58:20 altair socat[738]: Parsing config from <saved>
Apr 12 16:58:20 altair socat[738]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using q
Apr 12 16:58:20 altair socat[738]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 16:58:20 altair socat[738]: xc: info: Restoring domain
Apr 12 16:58:28 altair socat[738]: xc: info: Restore successful
Apr 12 16:58:28 altair socat[738]: xc: info: XenStore: mfn 0xce734b, dom 0, evt 1
Apr 12 16:58:28 altair socat[738]: xc: info: Console: mfn 0xce734c, dom 0, evt 2

.. but 1 second later the domU gets a kernel panic (see attachment oops-1.txt).

There are cases where the dom0 logs a failure. After this failure the domU disappeared:

Apr 12 14:17:55 altair socat[738]: migration target: Ready to receive domain.
Apr 12 14:17:55 altair socat[738]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 14:17:55 altair socat[738]: Savefile contains xl domain config in JSON format
Apr 12 14:17:55 altair socat[738]: Parsing config from <saved>
Apr 12 14:17:55 altair socat[738]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using qemu-xen-traditional instead: No such file or directory
Apr 12 14:17:55 altair socat[738]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 14:17:55 altair socat[738]: xc: info: Restoring domain
Apr 12 14:18:00 altair socat[738]: libxl-save-helper: xc_sr_restore_x86_pv.c:7: pfn_to_mfn: Assertion `pfn <= ctx->x86_pv.max_pfn' failed.
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_utils.c:510:libxl_read_exactly: file/stream truncated reading ipc msg header from domain 7 save/restore helper stdout pipe
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus: domain 7 save/restore helper [18962] died due to fatal signal Aborted
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_create.c:1264:domcreate_rebuild_done: Domain 7:cannot (re-)build domain: -3
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 7:Non-existant domain
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 7:Unable to destroy guest
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 7:Destruction of domain failed
Apr 12 14:18:00 altair socat[738]: migration target: Domain creation failed (code -3).
Apr 12 14:18:00 altair socat[18950]: E write(5, 0x559e0ffc85c0, 8192): Broken pipe

And in this case the domU was running on the destination dom0, but it crashed
immediately (see attachment oops-2.txt).

Apr 12 14:44:24 rho socat[725]: migration target: Ready to receive domain.
Apr 12 14:44:24 rho socat[725]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 14:44:24 rho socat[725]: Savefile contains xl domain config in JSON format
Apr 12 14:44:24 rho socat[725]: Parsing config from <saved>
Apr 12 14:44:24 rho socat[725]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using qemu-xen-traditional instead: No such file or directory
Apr 12 14:44:24 rho socat[725]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 14:44:24 rho socat[725]: xc: info: Restoring domain
Apr 12 14:45:31 rho socat[725]: xc: error: Failed to read Record Header from stream (0 = Success): Internal error
Apr 12 14:45:31 rho socat[725]: xc: error: Restore failed (0 = Success): Internal error
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_stream_read.c:850:libxl__xc_domain_restore_done: restoring domain: Success
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_create.c:1264:domcreate_rebuild_done: Domain 11:cannot (re-)build domain: -3
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 11:Non-existant domain
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 11:Unable to destroy guest
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 11:Destruction of domain failed
Apr 12 14:45:31 rho socat[725]: migration target: Domain creation failed (code -3).

We have been running Xen 4.4 on Debian Jessie (Linux 3.16.51-3+deb8u1) on the
same hardware flawlessly for the past years.

Does anyone have similar experiences with Xen 4.10? How can we help debugging
and finding the cause of these issues?

Thanks!

--
Pim van den Berg

Hans van Kranenburg

2018-09-04 15:41:03 UTC

Permalink

Hi,

Post by Pim van den Berg
Hi all,
We (at Mendix) are upgrading our dom0s to Xen 4.10 (PV) running on Debian
Stretch (Linux 4.9), but we are running into an issue regarding live-migration.
We are experiencing domU crashes while live-migrating and in the seconds after
the live-migration has been completed. This doesn't happen all the time. But we
are able to reproduce the issue within 1 to max 10 times live migrating between
2 dom0s.
We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian
Stretch) and 4.15.11-1 (Debian Buster).
[...]

So... flash forward *whoosh*:

For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for
dom0 as well as domU) if you want to use live migration, or maybe even
in general together with Xen.

A few of the things I could cause to happen with recent Linux 4.9 in
dom0/domU:

1) blk-mq related Oops

Oops in the domU while resuming after live migrate (blkfront_resume ->
blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit ->
blk_mq_insert_requests). A related fix might be
https://patchwork.kernel.org/patch/9462771/ but that's only present in
later kernels.

Apparently having this happen upsets the dom0 side of it, since any
subsequent domU that is live migrated to the same dom0, also using
blk-mq will immediately crash with the same Oops, after which is starts
raining general protection faults inside. But, at the same time, I can
still live migrate 3.16 kernels, but also 4.17 domU kernels on and off
that dom0.

2) Dom0 crash on live migration with multiple active nics

I actually have to do more testing for specifically this, but at least
I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last
tested a few months ago, Debian Jessie) by live migrating a domU that
has multiple network interfaces, actively routing traffic over them, to
it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set
- not rebooting.' *BOOM* everything gone.

3) xenconsoled disappearing

When live migration errors happen, it regularly happens that the
xenconsoled process in dom0 just disappears. I have no idea why. No
segfault message in dmesg or anything, it's just gone.

These are just examples. There are more errors that I ran into, and that
I still have to re-test again. If someone is interested in more details,
I have a collection of errors and stack traces etc.

What did I end up with now?

* Xen 4.11 (latest stable-4.11)
* Linux 4.17.17 in (Debian Stretch) dom0 and in (Stretch, Buster) domUs
* Linux 3.16.57 for old Jessie domUs is not a problem.

In a small test environment, I just completed about 2000 random live
migrations movements of ~20 domUs over 6 dom0s, throwing 10 concurrent
at it, without anything bad happen. To generate at least some extra
load, I was continuously running puppet on them, while the puppet
masters were also in the domU mix.

With 4.9 anywhere, it would only take a few minutes for everything to
explode.

To be continued...

Hans

Sarah Newman

2018-09-12 18:55:21 UTC

Permalink

Post by Hans van Kranenburg

Post by Pim van den Berg
We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian
Stretch) and 4.15.11-1 (Debian Buster).
[...]

For Debian users, it seems best to avoid the Debian 4.9 LTS Linux (for
dom0 as well as domU) if you want to use live migration, or maybe even
in general together with Xen.
A few of the things I could cause to happen with recent Linux 4.9 in
1) blk-mq related Oops
Oops in the domU while resuming after live migrate (blkfront_resume ->
blk_mq_update_nr_hw_queues -> blk_mq_queue_reinit ->
blk_mq_insert_requests). A related fix might be
https://patchwork.kernel.org/patch/9462771/ but that's only present in
later kernels.
Apparently having this happen upsets the dom0 side of it, since any
subsequent domU that is live migrated to the same dom0, also using
blk-mq will immediately crash with the same Oops, after which is starts
raining general protection faults inside. But, at the same time, I can
still live migrate 3.16 kernels, but also 4.17 domU kernels on and off
that dom0.

Do you see any errors at all on the dom0?
You said you tested with both 4.9 and 4.15 kernels, does this depend only on a 4.9 kernel in the domU?

Post by Hans van Kranenburg
2) Dom0 crash on live migration with multiple active nics
I actually have to do more testing for specifically this, but at least
I'm able to reliably crash a 4.9 Linux dom0 running on Xen 4.4 (last
tested a few months ago, Debian Jessie) by live migrating a domU that
has multiple network interfaces, actively routing traffic over them, to
it. *poof*, hypervisor reporting '(XEN) Domain 0 crashed: 'noreboot' set
- not rebooting.' *BOOM* everything gone.

Can you post a full backtrace? Did you ever test with anything other than 4.9 kernel + 4.4 hypervisor?
What does "actively routing traffic" mean in terms of packet frequency, and did you test when there was
no network traffic but the interface was up?
A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network traffic did not duplicate this.

--Sarah

Sarah Newman

2018-09-12 20:44:20 UTC

Permalink

Post by Sarah Newman

Post by Hans van Kranenburg

Post by Pim van den Berg
We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian
Stretch) and 4.15.11-1 (Debian Buster).
[...]

Do you see any errors at all on the dom0?

Nope.

What is your storage stack?

Post by Sarah Newman
You said you tested with both 4.9 and 4.15 kernels, does this depend only on a 4.9 kernel in the domU?

I don't know for sure (about 4.15 and if it has the mentioned patch or
not). We (exploratory style) tested a few combinations of things some
time ago, when 4.15 was in stretch-backports. At the end of the day the
results were so unpredictable that we put doing testing in a more
structured way on the todo-list (6-dimensional matrix of possibilities
D: ). What I did recently is again just randomly trying things for a few
hours, and then I started to see the pattern that whenever 4.9 was in
the mix anywhere, bad things happened. Doing the reverse, eliminating
4.9 in dom0 as well as domU resulted in not being able to reproduce
anything bad any more.
So, very pragmatic. :)

So to rephrase you don't know if you saw failures with a 4.15 domU and a 4.9 dom0?

The mentioned patch is d1b1cea1e58477dad88ff769f54c0d2dfa56d923 and was added in 4.10. I assume you think it should be added to 4.9? Why do you think
it is related?

Post by Sarah Newman

Can you post a full backtrace? Did you ever test with anything other than 4.9 kernel + 4.4 hypervisor?

Did not re-test yet.
Ah, I found my notes. It's a bit different. When just doing live
migrate, it would upset the bnx2x driver or network card itself and I
would lose network connectivity to the machine (and all other domUs).
See attached bnx2x-crash.txt for console output while the poor thing is
drowning and gasping for air.
When disabling SR-IOV (which I do not use, but which was listed
somewhere as a workaround for a similar problem, related to HP Shared
Memory blah, so why not try it to see what happens) in the BIOS for the
10G card and then trying the same, the dom0 crashed immediately when the
live migrated domU was resumed. See dom0-crash.txt No trace or anything,
it just disappears.

This shared memory is an HP only thing, right? I think I saw some recommendations to the reverse, to disable shared memory and enable SR-IOV.

Post by Sarah Newman
What does "actively routing traffic" mean in terms of packet frequency, and did you test when there was
no network traffic but the interface was up?

A linux domU doing NAT with 1 external and 6 internal interfaces, having
a conntrack table with ~20k entries of active traffic flows. However,
not doing many pps and not using much bandwidth (between 0 and 100 Mbit/s).
Without any traffic it doesn't explode immediately. I think I could live
migrate the inactive router of a stateful (conntrackd) pair.

Post by Sarah Newman
A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network traffic did not duplicate this.

I'll get around to reproducing this (or not being able to with Xen 4.11+
Linux 4.17+ with maybe newer bnx2x).
Currently the network infra related domUs are still on Jessie (Xen 4.4
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044
And while speaking of that, we've not seen this happen again with 4.17+
in the dom0, and same openvswitch and Xen 4.11 version.

Have you ever rebuilt your kernel with options such as DEBUG_PAGEALLOC? I found some errors almost immediately with one of our network drivers after
doing so.

Hans van Kranenburg

2018-09-12 22:12:08 UTC

Permalink

Hi,

(my previous reply was eaten by the list, maybe it was too big with the
attachments, maybe because posted from wrong email address, but text is
in here:)

Post by Sarah Newman

Post by Hans van Kranenburg

Post by Pim van den Berg
We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian
Stretch) and 4.15.11-1 (Debian Buster).
[...]

Do you see any errors at all on the dom0?

Nope.

What is your storage stack?

iSCSI ----> dm_multipath -> dm_crypt --,
iSCSI --' \---> LVM
/
iSCSI ----> dm_multipath -> dm_crypt --'
iSCSI --'

An LVM logical volume is the block device for e.g. a domU xvda.

Post by Sarah Newman

Post by Sarah Newman
You said you tested with both 4.9 and 4.15 kernels, does this depend only on a 4.9 kernel in the domU?

So to rephrase you don't know if you saw failures with a 4.15 domU and a 4.9 dom0?

Correct, I don't have notes about that, so I can't say for sure.

Post by Sarah Newman
The mentioned patch is d1b1cea1e58477dad88ff769f54c0d2dfa56d923 and was added in 4.10. I assume you think it should be added to 4.9? Why do you think
it is related?

I'm not an expert here. What happens feels like some sort of race
condition or wrong order of doing things, where a function runs before
something it depends on is there yet.

I do not think the mentioned patch is the fix. It is not a good match
for the shown behavior here. I meant that it's probably a similar kind
of fix related to doing IO and onlining/offlining a cpu, setting up
queues etc? just like what's this one about...

Post by Sarah Newman

Can you post a full backtrace? Did you ever test with anything other than 4.9 kernel + 4.4 hypervisor?

This shared memory is an HP only thing, right?

I think so yes.

Post by Sarah Newman
I think I saw some recommendations to the reverse, to disable shared memory and enable SR-IOV.

Post by Sarah Newman
What does "actively routing traffic" mean in terms of packet frequency, and did you test when there was
no network traffic but the interface was up?

Post by Sarah Newman
A quick test with a 4.9 kernel + xen 4.8 but not terribly heavy network traffic did not duplicate this.

Have you ever rebuilt your kernel with options such as DEBUG_PAGEALLOC? I found some errors almost immediately with one of our network drivers after
doing so.

No, thanks for the hint.

Right now the top of the todo list is to reinstall some HP dl360 gen8 as
well as and gen9 to latest BIOS + Stretch/Linux 4.17+ dom0 + Xen 4.11
and then start testing different scenarios to see if it's as stable as
the same on the g7 and if I can still reproduce things like above.

Hans