[Xen-users] Debugging sudden hangs

Discussion:

Eric Duncan

2018-08-19 13:09:55 UTC

Hi list,
We recently updated our system and started experiencing random
hangs. It happens, on average, once every 1.5 days (sometimes taking 2
days to occur, other times happening multiple times a day, somewhat
proportional to IO load).
Before troubling the developers too much, I'd like to collect more
information, however, the problem is the hangs occur without any
"loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug"
respectively.
When the hang occurs, all domUs and dom0 just stop responding to
key presses, networking and there is no IO activity. Nothing gets
generated in the console/logs (no symptoms either, no logs out of the
ordinary). Even hitting ctrl+a multiple times in the console does
nothing (indicating xen is dead too). On the video console, we just
have a blinking cursor after the last console log (though my
understanding is that the cursor blink might be generated by the video
card rather than any indication that at least something is still
running). If the hardware WDT is on, the watchdog eventually bites and
reboots the system.
Although I believe it isn't related (since dom0 stalls too, and
we're looking at a completely stalled system rather than just domUs
having issues with disk IO), I added "gnttab_max_frames=256" to the
xen boot arguments anyway. Didn't seem to change anything.
Then, grasping at straws, I turned off HWPM in BIOS, which we had
to do so on another machine hosting VMware ESX, obviously didn't seem
to change anything either.
At this point, I'd like to know what is the best way to approach
this? Can I enable further levels of debugging so that I can even
begin to look towards a certain culprit? Is there a good way to
determine if it may be the hardware?
I've tried running the same kernel without xen and just simulating
heavy IO on the disk array without issues, which leans me towards xen
being part of the equation. But then again, doing random file
read/writes isn't a good simulation of the type of workload the domUs
put on the server.
OS: Debian Buster
Kernel: 4.17.0-1-amd64
Xen: 4.8.4-pre (Debian 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9)
CPU: Xeon E5-2699 v4
RAM: Samsung 96GB ECC Registered
MB: Supermicro X10SRi-F
In case it is relevant, since it might be IO related...
Net: Chelsio T520-CR (2 x XGB links, shared to domU using VF)
RAID: LSI SAS3224 with 10 SAS3 drives
Warm regards,
Liwei
_______________________________________________
Xen-users mailing list
https://lists.xenproject.org/mailman/listinfo/xen-users

In my experience, as a non-Xen user on nearly the identical motherboard
(X10SRA), I would suggest the motherboard.

I've purchased 4 of these boards and run various Windows and Linux
kernels. They all have different CPUs (some Retail, some Engineering
Samples), different ECC ram and different storage setups (some using
onboard SATA, some using on LSI cards, etc).

They all, every single one of them, experience random hard-lockups just
like you describe: becomes completely unresponsive, screen freezes, etc.

I don't run Xen on any of them. I've swapped all sorts of hardware, tried
several beta BIOS versions from support, RMA'd 3 of them... They all
continued to lockup.

This went on for about two years until I had enough. I swapped all boards
out for the X10DLA, using the exact same components, and I have had zero
issues since.

Again, this is just one user's experience - and I just happened to be on
the Xen mailing list and saw this.

Konrad Eisele

2018-08-20 06:17:20 UTC

Permalink

Systems that get overheated also exhibit this kind of behavior. I was
experiencing it once with a epyc mb that was crammed into a 1U case.

Post by Eric Duncan

In my experience, as a non-Xen user on nearly the identical motherboard
(X10SRA), I would suggest the motherboard.
I've purchased 4 of these boards and run various Windows and Linux
kernels. They all have different CPUs (some Retail, some Engineering
Samples), different ECC ram and different storage setups (some using
onboard SATA, some using on LSI cards, etc).
They all, every single one of them, experience random hard-lockups just
like you describe: becomes completely unresponsive, screen freezes, etc.
I don't run Xen on any of them. I've swapped all sorts of hardware, tried
several beta BIOS versions from support, RMA'd 3 of them... They all
continued to lockup.
This went on for about two years until I had enough. I swapped all boards
out for the X10DLA, using the exact same components, and I have had zero
issues since.
Again, this is just one user's experience - and I just happened to be on
the Xen mailing list and saw this.
_______________________________________________
Xen-users mailing list
https://lists.xenproject.org/mailman/listinfo/xen-users

Roger Pau Monné

2018-08-20 09:03:06 UTC

Permalink

You should add iommu=debug to the command line.

When the hang occurs, all domUs and dom0 just stop responding to
key presses, networking and there is no IO activity. Nothing gets
generated in the console/logs (no symptoms either, no logs out of the
ordinary). Even hitting ctrl+a multiple times in the console does
nothing (indicating xen is dead too). On the video console, we just
have a blinking cursor after the last console log (though my
understanding is that the cursor blink might be generated by the video
card rather than any indication that at least something is still
running). If the hardware WDT is on, the watchdog eventually bites and
reboots the system.

It would be interesting to get the crash trace printed by the watchdog.
And to use a debug build of the hypervisor, that might trigger some
assertions inside of Xen that could lead to the cause of the issue.

Roger.

Liwei

2018-11-07 17:31:22 UTC

Permalink

You should add iommu=debug to the command line.

Hi Roger, list, I've been trying to find time to look into this but
other work have been keeping me away ever since I found an ugly (and
definitely unsafe) workaround.

Downgrading all the way to 4.2.5 actually fixed the problem. Or maybe
stops exercising the offending hardware (if it is a hardware issue).
It might be possible that newer versions of xen will work, but we've
been okay with this for now since the server is not world-facing.

However, obviously 4.2.5 is way behind on the headline (and other)
security issues the past few years. I'll get around to isolating the
cause of the sudden hangs in December, probably with your suggestions
and a bisect run.

Just sending this email to let the list know of a dangerous workaround
in case anyone has to use it.