Pim van den Berg
2018-04-13 07:38:50 UTC
Hi all,
We (at Mendix) are upgrading our dom0s to Xen 4.10 (PV) running on Debian
Stretch (Linux 4.9), but we are running into an issue regarding live-migration.
We are experiencing domU crashes while live-migrating and in the seconds after
the live-migration has been completed. This doesn't happen all the time. But we
are able to reproduce the issue within 1 to max 10 times live migrating between
2 dom0s.
We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian
Stretch) and 4.15.11-1 (Debian Buster).
Attached are all kernel traces, oopses per crash that are logged from the domUs
and retrieved via "xen console" in the seconds after the live-migration is
completed. In some cases the domU keeps on running or being visible via "xen
list", in other cases the domU disappears from "xen list" after a short amount
of time.
From the logging in our dom0s in most cases everything looks fine:
Apr 12 16:58:20 altair socat[738]: migration target: Ready to receive domain.
Apr 12 16:58:20 altair socat[738]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 16:58:20 altair socat[738]: Savefile contains xl domain config in JSON format
Apr 12 16:58:20 altair socat[738]: Parsing config from <saved>
Apr 12 16:58:20 altair socat[738]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using q
Apr 12 16:58:20 altair socat[738]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 16:58:20 altair socat[738]: xc: info: Restoring domain
Apr 12 16:58:28 altair socat[738]: xc: info: Restore successful
Apr 12 16:58:28 altair socat[738]: xc: info: XenStore: mfn 0xce734b, dom 0, evt 1
Apr 12 16:58:28 altair socat[738]: xc: info: Console: mfn 0xce734c, dom 0, evt 2
.. but 1 second later the domU gets a kernel panic (see attachment oops-1.txt).
There are cases where the dom0 logs a failure. After this failure the domU disappeared:
Apr 12 14:17:55 altair socat[738]: migration target: Ready to receive domain.
Apr 12 14:17:55 altair socat[738]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 14:17:55 altair socat[738]: Savefile contains xl domain config in JSON format
Apr 12 14:17:55 altair socat[738]: Parsing config from <saved>
Apr 12 14:17:55 altair socat[738]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using qemu-xen-traditional instead: No such file or directory
Apr 12 14:17:55 altair socat[738]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 14:17:55 altair socat[738]: xc: info: Restoring domain
Apr 12 14:18:00 altair socat[738]: libxl-save-helper: xc_sr_restore_x86_pv.c:7: pfn_to_mfn: Assertion `pfn <= ctx->x86_pv.max_pfn' failed.
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_utils.c:510:libxl_read_exactly: file/stream truncated reading ipc msg header from domain 7 save/restore helper stdout pipe
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus: domain 7 save/restore helper [18962] died due to fatal signal Aborted
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_create.c:1264:domcreate_rebuild_done: Domain 7:cannot (re-)build domain: -3
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 7:Non-existant domain
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 7:Unable to destroy guest
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 7:Destruction of domain failed
Apr 12 14:18:00 altair socat[738]: migration target: Domain creation failed (code -3).
Apr 12 14:18:00 altair socat[18950]: E write(5, 0x559e0ffc85c0, 8192): Broken pipe
And in this case the domU was running on the destination dom0, but it crashed
immediately (see attachment oops-2.txt).
Apr 12 14:44:24 rho socat[725]: migration target: Ready to receive domain.
Apr 12 14:44:24 rho socat[725]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 14:44:24 rho socat[725]: Savefile contains xl domain config in JSON format
Apr 12 14:44:24 rho socat[725]: Parsing config from <saved>
Apr 12 14:44:24 rho socat[725]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using qemu-xen-traditional instead: No such file or directory
Apr 12 14:44:24 rho socat[725]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 14:44:24 rho socat[725]: xc: info: Restoring domain
Apr 12 14:45:31 rho socat[725]: xc: error: Failed to read Record Header from stream (0 = Success): Internal error
Apr 12 14:45:31 rho socat[725]: xc: error: Restore failed (0 = Success): Internal error
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_stream_read.c:850:libxl__xc_domain_restore_done: restoring domain: Success
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_create.c:1264:domcreate_rebuild_done: Domain 11:cannot (re-)build domain: -3
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 11:Non-existant domain
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 11:Unable to destroy guest
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 11:Destruction of domain failed
Apr 12 14:45:31 rho socat[725]: migration target: Domain creation failed (code -3).
We have been running Xen 4.4 on Debian Jessie (Linux 3.16.51-3+deb8u1) on the
same hardware flawlessly for the past years.
Does anyone have similar experiences with Xen 4.10? How can we help debugging
and finding the cause of these issues?
Thanks!
We (at Mendix) are upgrading our dom0s to Xen 4.10 (PV) running on Debian
Stretch (Linux 4.9), but we are running into an issue regarding live-migration.
We are experiencing domU crashes while live-migrating and in the seconds after
the live-migration has been completed. This doesn't happen all the time. But we
are able to reproduce the issue within 1 to max 10 times live migrating between
2 dom0s.
We've reproduced this so far with domUs running Linux 4.9.82-1+deb9u3 (Debian
Stretch) and 4.15.11-1 (Debian Buster).
Attached are all kernel traces, oopses per crash that are logged from the domUs
and retrieved via "xen console" in the seconds after the live-migration is
completed. In some cases the domU keeps on running or being visible via "xen
list", in other cases the domU disappears from "xen list" after a short amount
of time.
From the logging in our dom0s in most cases everything looks fine:
Apr 12 16:58:20 altair socat[738]: migration target: Ready to receive domain.
Apr 12 16:58:20 altair socat[738]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 16:58:20 altair socat[738]: Savefile contains xl domain config in JSON format
Apr 12 16:58:20 altair socat[738]: Parsing config from <saved>
Apr 12 16:58:20 altair socat[738]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using q
Apr 12 16:58:20 altair socat[738]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 16:58:20 altair socat[738]: xc: info: Restoring domain
Apr 12 16:58:28 altair socat[738]: xc: info: Restore successful
Apr 12 16:58:28 altair socat[738]: xc: info: XenStore: mfn 0xce734b, dom 0, evt 1
Apr 12 16:58:28 altair socat[738]: xc: info: Console: mfn 0xce734c, dom 0, evt 2
.. but 1 second later the domU gets a kernel panic (see attachment oops-1.txt).
There are cases where the dom0 logs a failure. After this failure the domU disappeared:
Apr 12 14:17:55 altair socat[738]: migration target: Ready to receive domain.
Apr 12 14:17:55 altair socat[738]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 14:17:55 altair socat[738]: Savefile contains xl domain config in JSON format
Apr 12 14:17:55 altair socat[738]: Parsing config from <saved>
Apr 12 14:17:55 altair socat[738]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using qemu-xen-traditional instead: No such file or directory
Apr 12 14:17:55 altair socat[738]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 14:17:55 altair socat[738]: xc: info: Restoring domain
Apr 12 14:18:00 altair socat[738]: libxl-save-helper: xc_sr_restore_x86_pv.c:7: pfn_to_mfn: Assertion `pfn <= ctx->x86_pv.max_pfn' failed.
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_utils.c:510:libxl_read_exactly: file/stream truncated reading ipc msg header from domain 7 save/restore helper stdout pipe
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus: domain 7 save/restore helper [18962] died due to fatal signal Aborted
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_create.c:1264:domcreate_rebuild_done: Domain 7:cannot (re-)build domain: -3
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 7:Non-existant domain
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 7:Unable to destroy guest
Apr 12 14:18:00 altair socat[738]: libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 7:Destruction of domain failed
Apr 12 14:18:00 altair socat[738]: migration target: Domain creation failed (code -3).
Apr 12 14:18:00 altair socat[18950]: E write(5, 0x559e0ffc85c0, 8192): Broken pipe
And in this case the domU was running on the destination dom0, but it crashed
immediately (see attachment oops-2.txt).
Apr 12 14:44:24 rho socat[725]: migration target: Ready to receive domain.
Apr 12 14:44:24 rho socat[725]: Loading new save file <incoming migration stream> (new xl fmt info 0x3/0x0/1250)
Apr 12 14:44:24 rho socat[725]: Savefile contains xl domain config in JSON format
Apr 12 14:44:24 rho socat[725]: Parsing config from <saved>
Apr 12 14:44:24 rho socat[725]: libxl: info: libxl_create.c:109:libxl__domain_build_info_setdefault: qemu-xen is unavailable, using qemu-xen-traditional instead: No such file or directory
Apr 12 14:44:24 rho socat[725]: xc: info: Found x86 PV domain from Xen 4.10
Apr 12 14:44:24 rho socat[725]: xc: info: Restoring domain
Apr 12 14:45:31 rho socat[725]: xc: error: Failed to read Record Header from stream (0 = Success): Internal error
Apr 12 14:45:31 rho socat[725]: xc: error: Restore failed (0 = Success): Internal error
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_stream_read.c:850:libxl__xc_domain_restore_done: restoring domain: Success
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_create.c:1264:domcreate_rebuild_done: Domain 11:cannot (re-)build domain: -3
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 11:Non-existant domain
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 11:Unable to destroy guest
Apr 12 14:45:31 rho socat[725]: libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 11:Destruction of domain failed
Apr 12 14:45:31 rho socat[725]: migration target: Domain creation failed (code -3).
We have been running Xen 4.4 on Debian Jessie (Linux 3.16.51-3+deb8u1) on the
same hardware flawlessly for the past years.
Does anyone have similar experiences with Xen 4.10? How can we help debugging
and finding the cause of these issues?
Thanks!
--
Pim van den Berg
Pim van den Berg