Discussion:
[Xen-users] Dom0 crashes without logging lately on Debian Stretch with Xen 4.8
Roalt Zijlstra | webpower
2018-10-29 11:57:24 UTC
Permalink
Hi there,

Ever since all the Meltdown and Spectre kernel updates and possibly also
Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue.
Sometimes after 1 day, sometimes after a few days or even 14 days,
completely random.

We have two Dell P730 servers and two Dell P720 servers with this
behaviour. One thing is that we updated these machine to the latest
available firmware, because that is the most secure way. Then we installed
Debian Stretch with Xen 4.8 support

We have done serveral installs and 4 servers seem to crash pretty fast and
other don't. In the end we think that we can lead it back to the
xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
This was kinda independent of the kernel that we were using 4.14 or
4.9.0-8-amd64. This is off course all Debian package numbering.

As last resort we updated on one server all DomU kernels of our Jessie
servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel.
For now that seems to work, but the crashes are random so it could happen
any time again. The idea is that these kernels are completely spectre&
meltdown unaware and might cause trouble in Xen kernel support. I am not
sure if this is true at all, but we are pretty lost what the actual cause
is.

We also tested with CentOS and we also had these crashes there with certain
combinations of kernel/Xen. The most recent updates seem to be more stable
tough. The most frustrating part is the there is absolutely no logs to be
found. No kernel oops or what.. the server just resets and boots again.

Are there others experiencing problems like this? Do you see more frequent
server/kernel crashes on production servers?

Best regards,
Roalt Zijlstra
Jean-Louis Dupond
2018-10-30 11:12:39 UTC
Permalink
Hi Roalt,

We are running Xen 4.6 on 4.9.x kernel and CentOS 6, and are having the
same issues.
But not that frequent as you state. Only like once every month.

The systems (Dell R630) also crashes/resets without any message. So
nothing is logged unfortunately :(

The crashes were not observed on Xen 4.4.

We configured the servers to print kernel logs to SOL (Serial Over LAN
via iDRAC), and we log those.
But since then no crashed servers anymore, so we don't know yet if this
will give us some more details.


Thanks
Jean-Louis

On 29/10/18 12:57, Roalt Zijlstra | webpower wrote:
> Hi there,
>
> Ever since all the Meltdown and Spectre kernel updates and possibly
> also Xen 4.8 updates, we experience crashes of the Dom0 just out of
> the blue. Sometimes after 1 day, sometimes after a few days or even 14
> days, completely random.
>
> We have two Dell P730 servers and two Dell P720 servers with this
> behaviour. One thing is that we updated these machine to the latest
> available firmware, because that is the most secure way. Then we
> installed Debian Stretch with Xen 4.8 support
>
> We have done serveral installs and 4 servers seem to crash pretty fast
> and other don't. In the end we think that we can lead it back to the
> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being
> unstable. This was kinda independent of the kernel that we were using
> 4.14 or 4.9.0-8-amd64. This is off course all Debian package numbering.
>
> As last resort  we updated on one server all DomU kernels of our
> Jessie servers on this Dom0 to 4.9.0 from backports instead of the
> 3.16 kernel. For now that seems to work, but the crashes are random so
> it could happen any time again. The idea is that these kernels are
> completely spectre& meltdown unaware and might cause trouble in Xen
> kernel support. I am not sure if this is true at all, but we are
> pretty lost what the actual cause is.
>
> We also tested with CentOS and we also had these crashes there with
> certain combinations of kernel/Xen. The most recent updates seem to be
> more stable tough. The most frustrating part is the there is
> absolutely no logs to be found. No kernel oops or what.. the server
> just resets and boots again.
>
> Are there others experiencing problems like this? Do you see more
> frequent server/kernel crashes on production servers?
>
> Best regards,
>
> Roalt Zijlstra
>
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
Roalt Zijlstra | webpower
2018-10-30 15:39:05 UTC
Permalink
Hi Jean-Louis,

Thanks for sharing this info.

If I look at my set of servers then we also have CentOS 6 servers which
used to crash after 3 or 4 weeks.
However with the 4.9.112-32.el6.x86_64 kernel and Xen 4.8.4-1.el6 it looks
to be more stable with 65+ days of uptime on six servers running that
particular setup.

I still think that the 4.8.4-pre Xen package for Debian is the stable
version, so if you think of upgrading to 4.8.5-pre I would not recommend
it yet.

Best regards,

Roalt Zijlstra
Teamleader Infra & Deliverability

***@webpower.nl
+31 342 423 262
roalt.zijlstra
https://www.webpower-group.com

<https://www.webpower-group.com/>

[image: Facebook]
<https://www.facebook.com/webpower.marketingautomation/> [image:
Twitter] <https://twitter.com/webpower> [image: Linkedin]
<https://www.linkedin.com/company/36782/>
Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
Hamburg | Shanghai | Shenzhen | Stockholm
<https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>


Op di 30 okt. 2018 om 12:12 schreef Jean-Louis Dupond <jean-***@dupond.be
>:

> Hi Roalt,
>
> We are running Xen 4.6 on 4.9.x kernel and CentOS 6, and are having the
> same issues.
> But not that frequent as you state. Only like once every month.
>
> The systems (Dell R630) also crashes/resets without any message. So
> nothing is logged unfortunately :(
>
> The crashes were not observed on Xen 4.4.
>
> We configured the servers to print kernel logs to SOL (Serial Over LAN via
> iDRAC), and we log those.
> But since then no crashed servers anymore, so we don't know yet if this
> will give us some more details.
>
>
> Thanks
> Jean-Louis
> On 29/10/18 12:57, Roalt Zijlstra | webpower wrote:
>
> Hi there,
>
> Ever since all the Meltdown and Spectre kernel updates and possibly also
> Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue.
> Sometimes after 1 day, sometimes after a few days or even 14 days,
> completely random.
>
> We have two Dell P730 servers and two Dell P720 servers with this
> behaviour. One thing is that we updated these machine to the latest
> available firmware, because that is the most secure way. Then we installed
> Debian Stretch with Xen 4.8 support
>
> We have done serveral installs and 4 servers seem to crash pretty fast and
> other don't. In the end we think that we can lead it back to the
> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
> This was kinda independent of the kernel that we were using 4.14 or
> 4.9.0-8-amd64. This is off course all Debian package numbering.
>
> As last resort we updated on one server all DomU kernels of our Jessie
> servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel.
> For now that seems to work, but the crashes are random so it could happen
> any time again. The idea is that these kernels are completely spectre&
> meltdown unaware and might cause trouble in Xen kernel support. I am not
> sure if this is true at all, but we are pretty lost what the actual cause
> is.
>
> We also tested with CentOS and we also had these crashes there with
> certain combinations of kernel/Xen. The most recent updates seem to be more
> stable tough. The most frustrating part is the there is absolutely no logs
> to be found. No kernel oops or what.. the server just resets and boots
> again.
>
> Are there others experiencing problems like this? Do you see more frequent
> server/kernel crashes on production servers?
>
> Best regards,
> Roalt Zijlstra
>
>
> _______________________________________________
> Xen-users mailing listXen-***@lists.xenproject.orghttps://lists.xenproject.org/mailman/listinfo/xen-users
>
>
Volker Janzen
2018-11-01 12:42:55 UTC
Permalink
Hi,

I had these crash problems with the Xen version in Debian stretch, too. After 3 to 7 days the Xen server rebooted without log entry or something else to observe. The problems started when the first patches were applied by Debian. Some updates made it better, the last worse again. I checked hard drives, RAM and closely monitored metrics what might be the cause.

My solution after no longer suspecting a hardware fault: build upstream Xen 4.11 for Debian stretch. I am currently running this setup with my own build of kernel 4.19. The machines are now working stable again.


Volker


Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <***@webpower.nl<mailto:***@webpower.nl>>:

Hi there,

Ever since all the Meltdown and Spectre kernel updates and possibly also Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue. Sometimes after 1 day, sometimes after a few days or even 14 days, completely random.

We have two Dell P730 servers and two Dell P720 servers with this behaviour. One thing is that we updated these machine to the latest available firmware, because that is the most secure way. Then we installed Debian Stretch with Xen 4.8 support

We have done serveral installs and 4 servers seem to crash pretty fast and other don't. In the end we think that we can lead it back to the xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable. This was kinda independent of the kernel that we were using 4.14 or 4.9.0-8-amd64. This is off course all Debian package numbering.

As last resort we updated on one server all DomU kernels of our Jessie servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel. For now that seems to work, but the crashes are random so it could happen any time again. The idea is that these kernels are completely spectre& meltdown unaware and might cause trouble in Xen kernel support. I am not sure if this is true at all, but we are pretty lost what the actual cause is.

We also tested with CentOS and we also had these crashes there with certain combinations of kernel/Xen. The most recent updates seem to be more stable tough. The most frustrating part is the there is absolutely no logs to be found. No kernel oops or what.. the server just resets and boots again.

Are there others experiencing problems like this? Do you see more frequent server/kernel crashes on production servers?

Best regards,

Roalt Zijlstra


_______________________________________________
Xen-users mailing list
Xen-***@lists.xenproject.org<mailto:Xen-***@lists.xenproject.org>
https://lists.xenproject.org/mailman/listinfo/xen-users
John Naggets
2018-11-02 16:23:00 UTC
Permalink
I was wondering if any of you guys reported this bug/issue/problem back to
the Debian community? For example on their bugs.debian org web site?

On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl> wrote:

> Hi,
>
> I had these crash problems with the Xen version in Debian stretch, too.
> After 3 to 7 days the Xen server rebooted without log entry or something
> else to observe. The problems started when the first patches were applied
> by Debian. Some updates made it better, the last worse again. I checked
> hard drives, RAM and closely monitored metrics what might be the cause.
>
> My solution after no longer suspecting a hardware fault: build upstream
> Xen 4.11 for Debian stretch. I am currently running this setup with my own
> build of kernel 4.19. The machines are now working stable again.
>
>
> Volker
>
>
> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
> ***@webpower.nl>:
>
> Hi there,
>
> Ever since all the Meltdown and Spectre kernel updates and possibly also
> Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue.
> Sometimes after 1 day, sometimes after a few days or even 14 days,
> completely random.
>
> We have two Dell P730 servers and two Dell P720 servers with this
> behaviour. One thing is that we updated these machine to the latest
> available firmware, because that is the most secure way. Then we installed
> Debian Stretch with Xen 4.8 support
>
> We have done serveral installs and 4 servers seem to crash pretty fast and
> other don't. In the end we think that we can lead it back to the
> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
> This was kinda independent of the kernel that we were using 4.14 or
> 4.9.0-8-amd64. This is off course all Debian package numbering.
>
> As last resort we updated on one server all DomU kernels of our Jessie
> servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel.
> For now that seems to work, but the crashes are random so it could happen
> any time again. The idea is that these kernels are completely spectre&
> meltdown unaware and might cause trouble in Xen kernel support. I am not
> sure if this is true at all, but we are pretty lost what the actual cause
> is.
>
> We also tested with CentOS and we also had these crashes there with
> certain combinations of kernel/Xen. The most recent updates seem to be more
> stable tough. The most frustrating part is the there is absolutely no logs
> to be found. No kernel oops or what.. the server just resets and boots
> again.
>
> Are there others experiencing problems like this? Do you see more frequent
> server/kernel crashes on production servers?
>
> Best regards,
> Roalt Zijlstra
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
Volker Janzen
2018-11-02 18:53:38 UTC
Permalink
Hi John,

the problem is that I cannot provide any metrics or logfiles showing an error. I can only tell that dom0 is rebooting for a reason that is not logged. I have no physical access to the server. I got one other report about this kind of issue.

My assumption the cause are the backported patches is based on the current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It won’t be a useful bug report from my point of view.

The other thing is that my two servers are now running upstream Xen and kernel and I might not go back to both old versions in Debian stretch. The other server had always running upstream versions and had never a problem, that’s why I updated the other, too.


Best regards
Volker


Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com<mailto:***@gmail.com>>:

I was wondering if any of you guys reported this bug/issue/problem back to the Debian community? For example on their bugs.debian org web site?

On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl<mailto:***@janzen.onl>> wrote:
Hi,

I had these crash problems with the Xen version in Debian stretch, too. After 3 to 7 days the Xen server rebooted without log entry or something else to observe. The problems started when the first patches were applied by Debian. Some updates made it better, the last worse again. I checked hard drives, RAM and closely monitored metrics what might be the cause.

My solution after no longer suspecting a hardware fault: build upstream Xen 4.11 for Debian stretch. I am currently running this setup with my own build of kernel 4.19. The machines are now working stable again.


Volker


Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <***@webpower.nl<mailto:***@webpower.nl>>:

Hi there,

Ever since all the Meltdown and Spectre kernel updates and possibly also Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue. Sometimes after 1 day, sometimes after a few days or even 14 days, completely random.

We have two Dell P730 servers and two Dell P720 servers with this behaviour. One thing is that we updated these machine to the latest available firmware, because that is the most secure way. Then we installed Debian Stretch with Xen 4.8 support

We have done serveral installs and 4 servers seem to crash pretty fast and other don't. In the end we think that we can lead it back to the xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable. This was kinda independent of the kernel that we were using 4.14 or 4.9.0-8-amd64. This is off course all Debian package numbering.

As last resort we updated on one server all DomU kernels of our Jessie servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel. For now that seems to work, but the crashes are random so it could happen any time again. The idea is that these kernels are completely spectre& meltdown unaware and might cause trouble in Xen kernel support. I am not sure if this is true at all, but we are pretty lost what the actual cause is.

We also tested with CentOS and we also had these crashes there with certain combinations of kernel/Xen. The most recent updates seem to be more stable tough. The most frustrating part is the there is absolutely no logs to be found. No kernel oops or what.. the server just resets and boots again.

Are there others experiencing problems like this? Do you see more frequent server/kernel crashes on production servers?

Best regards,

Roalt Zijlstra


_______________________________________________
Xen-users mailing list
Xen-***@lists.xenproject.org<mailto:Xen-***@lists.xenproject.org>
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-***@lists.xenproject.org<mailto:Xen-***@lists.xenproject.org>
https://lists.xenproject.org/mailman/listinfo/xen-users
John Naggets
2018-11-05 09:33:59 UTC
Permalink
Hi,

Thanks for your feedback. I was wondering because I have just upgraded a
Debian 9 server to the latest kernel with the latest Xen packages from the
official Debian repo. The only difference is that I have an older IBM
server which is already ~7 years old patched with the latest BIOS/UEFI and
so far so good no crash. The uptime is 6 days for now. Here are the details
about my kernel and xen packages.

ii xen-hypervisor-4.8-amd64 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
amd64 Xen Hypervisor on AMD64
ii linux-image-4.9.0-8-amd64 4.9.110-3+deb9u6
amd64 Linux 4.9 for 64-bit PCs

Regards,
J.


On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <***@janzen.onl> wrote:

> Hi John,
>
> the problem is that I cannot provide any metrics or logfiles showing an
> error. I can only tell that dom0 is rebooting for a reason that is not
> logged. I have no physical access to the server. I got one other report
> about this kind of issue.
>
> My assumption the cause are the backported patches is based on the current
> 16 day uptime. 16 days ago the server rebooted every 3-5 days. It won’t be
> a useful bug report from my point of view.
>
> The other thing is that my two servers are now running upstream Xen and
> kernel and I might not go back to both old versions in Debian stretch. The
> other server had always running upstream versions and had never a problem,
> that’s why I updated the other, too.
>
>
> Best regards
> Volker
>
>
> Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com>:
>
> I was wondering if any of you guys reported this bug/issue/problem back to
> the Debian community? For example on their bugs.debian org web site?
>
> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl> wrote:
>
>> Hi,
>>
>> I had these crash problems with the Xen version in Debian stretch, too.
>> After 3 to 7 days the Xen server rebooted without log entry or something
>> else to observe. The problems started when the first patches were applied
>> by Debian. Some updates made it better, the last worse again. I checked
>> hard drives, RAM and closely monitored metrics what might be the cause.
>>
>> My solution after no longer suspecting a hardware fault: build upstream
>> Xen 4.11 for Debian stretch. I am currently running this setup with my own
>> build of kernel 4.19. The machines are now working stable again.
>>
>>
>> Volker
>>
>>
>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
>> ***@webpower.nl>:
>>
>> Hi there,
>>
>> Ever since all the Meltdown and Spectre kernel updates and possibly also
>> Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue.
>> Sometimes after 1 day, sometimes after a few days or even 14 days,
>> completely random.
>>
>> We have two Dell P730 servers and two Dell P720 servers with this
>> behaviour. One thing is that we updated these machine to the latest
>> available firmware, because that is the most secure way. Then we installed
>> Debian Stretch with Xen 4.8 support
>>
>> We have done serveral installs and 4 servers seem to crash pretty fast
>> and other don't. In the end we think that we can lead it back to the
>> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
>> This was kinda independent of the kernel that we were using 4.14 or
>> 4.9.0-8-amd64. This is off course all Debian package numbering.
>>
>> As last resort we updated on one server all DomU kernels of our Jessie
>> servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel.
>> For now that seems to work, but the crashes are random so it could happen
>> any time again. The idea is that these kernels are completely spectre&
>> meltdown unaware and might cause trouble in Xen kernel support. I am not
>> sure if this is true at all, but we are pretty lost what the actual cause
>> is.
>>
>> We also tested with CentOS and we also had these crashes there with
>> certain combinations of kernel/Xen. The most recent updates seem to be more
>> stable tough. The most frustrating part is the there is absolutely no logs
>> to be found. No kernel oops or what.. the server just resets and boots
>> again.
>>
>> Are there others experiencing problems like this? Do you see more
>> frequent server/kernel crashes on production servers?
>>
>> Best regards,
>> Roalt Zijlstra
>>
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
Roalt Zijlstra | webpower
2018-11-05 14:03:51 UTC
Permalink
Hi John,

It could very well be that it is also restricted to some CPUs, but I am
inclinded to believe that the used DomU kernels can influence stability.
We did have a pretty busy SSL offloader running on a 3.16 kernel, which
might have caused the crashes.

Just for reference, we have the following two CPUs causing us trouble, but
I am not sure if it matters.
Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz

Roalt


Op ma 5 nov. 2018 om 10:45 schreef John Naggets <***@gmail.com>:

> Hi,
>
> Thanks for your feedback. I was wondering because I have just upgraded a
> Debian 9 server to the latest kernel with the latest Xen packages from the
> official Debian repo. The only difference is that I have an older IBM
> server which is already ~7 years old patched with the latest BIOS/UEFI and
> so far so good no crash. The uptime is 6 days for now. Here are the details
> about my kernel and xen packages.
>
> ii xen-hypervisor-4.8-amd64
> 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64 Xen Hypervisor on
> AMD64
> ii linux-image-4.9.0-8-amd64
> 4.9.110-3+deb9u6 amd64 Linux 4.9 for 64-bit
> PCs
>
> Regards,
> J.
>
>
> On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <***@janzen.onl> wrote:
>
>> Hi John,
>>
>> the problem is that I cannot provide any metrics or logfiles showing an
>> error. I can only tell that dom0 is rebooting for a reason that is not
>> logged. I have no physical access to the server. I got one other report
>> about this kind of issue.
>>
>> My assumption the cause are the backported patches is based on the
>> current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It
>> won’t be a useful bug report from my point of view.
>>
>> The other thing is that my two servers are now running upstream Xen and
>> kernel and I might not go back to both old versions in Debian stretch. The
>> other server had always running upstream versions and had never a problem,
>> that’s why I updated the other, too.
>>
>>
>> Best regards
>> Volker
>>
>>
>> Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com>:
>>
>> I was wondering if any of you guys reported this bug/issue/problem back
>> to the Debian community? For example on their bugs.debian org web site?
>>
>> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl> wrote:
>>
>>> Hi,
>>>
>>> I had these crash problems with the Xen version in Debian stretch, too.
>>> After 3 to 7 days the Xen server rebooted without log entry or something
>>> else to observe. The problems started when the first patches were applied
>>> by Debian. Some updates made it better, the last worse again. I checked
>>> hard drives, RAM and closely monitored metrics what might be the cause.
>>>
>>> My solution after no longer suspecting a hardware fault: build upstream
>>> Xen 4.11 for Debian stretch. I am currently running this setup with my own
>>> build of kernel 4.19. The machines are now working stable again.
>>>
>>>
>>> Volker
>>>
>>>
>>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
>>> ***@webpower.nl>:
>>>
>>> Hi there,
>>>
>>> Ever since all the Meltdown and Spectre kernel updates and possibly also
>>> Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue.
>>> Sometimes after 1 day, sometimes after a few days or even 14 days,
>>> completely random.
>>>
>>> We have two Dell P730 servers and two Dell P720 servers with this
>>> behaviour. One thing is that we updated these machine to the latest
>>> available firmware, because that is the most secure way. Then we installed
>>> Debian Stretch with Xen 4.8 support
>>>
>>> We have done serveral installs and 4 servers seem to crash pretty fast
>>> and other don't. In the end we think that we can lead it back to the
>>> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
>>> This was kinda independent of the kernel that we were using 4.14 or
>>> 4.9.0-8-amd64. This is off course all Debian package numbering.
>>>
>>> As last resort we updated on one server all DomU kernels of our Jessie
>>> servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel.
>>> For now that seems to work, but the crashes are random so it could happen
>>> any time again. The idea is that these kernels are completely spectre&
>>> meltdown unaware and might cause trouble in Xen kernel support. I am not
>>> sure if this is true at all, but we are pretty lost what the actual cause
>>> is.
>>>
>>> We also tested with CentOS and we also had these crashes there with
>>> certain combinations of kernel/Xen. The most recent updates seem to be more
>>> stable tough. The most frustrating part is the there is absolutely no logs
>>> to be found. No kernel oops or what.. the server just resets and boots
>>> again.
>>>
>>> Are there others experiencing problems like this? Do you see more
>>> frequent server/kernel crashes on production servers?
>>>
>>> Best regards,
>>> Roalt Zijlstra
>>>
>>> _______________________________________________
>>> Xen-users mailing list
>>> Xen-***@lists.xenproject.org
>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>
>>> _______________________________________________
>>> Xen-users mailing list
>>> Xen-***@lists.xenproject.org
>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
John Naggets
2018-11-05 17:24:12 UTC
Permalink
It could be as you mention... your domU are they PV? I am using
paravirtualization exclusively and on this specific server have the
following CPU:

Intel(R) Xeon(R) CPU E5645 @ 2.40GHz

Do you have the intel-microcode Debian package from the non-free repo
installed on your servers? I currently don't...

J.


On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower <
***@webpower.nl> wrote:

> Hi John,
>
> It could very well be that it is also restricted to some CPUs, but I am
> inclinded to believe that the used DomU kernels can influence stability.
> We did have a pretty busy SSL offloader running on a 3.16 kernel, which
> might have caused the crashes.
>
> Just for reference, we have the following two CPUs causing us trouble, but
> I am not sure if it matters.
> Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
> Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>
> Roalt
>
>
> Op ma 5 nov. 2018 om 10:45 schreef John Naggets <***@gmail.com
> >:
>
>> Hi,
>>
>> Thanks for your feedback. I was wondering because I have just upgraded a
>> Debian 9 server to the latest kernel with the latest Xen packages from the
>> official Debian repo. The only difference is that I have an older IBM
>> server which is already ~7 years old patched with the latest BIOS/UEFI and
>> so far so good no crash. The uptime is 6 days for now. Here are the details
>> about my kernel and xen packages.
>>
>> ii xen-hypervisor-4.8-amd64
>> 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64 Xen Hypervisor on
>> AMD64
>> ii linux-image-4.9.0-8-amd64
>> 4.9.110-3+deb9u6 amd64 Linux 4.9 for 64-bit
>> PCs
>>
>> Regards,
>> J.
>>
>>
>> On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <***@janzen.onl> wrote:
>>
>>> Hi John,
>>>
>>> the problem is that I cannot provide any metrics or logfiles showing an
>>> error. I can only tell that dom0 is rebooting for a reason that is not
>>> logged. I have no physical access to the server. I got one other report
>>> about this kind of issue.
>>>
>>> My assumption the cause are the backported patches is based on the
>>> current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It
>>> won’t be a useful bug report from my point of view.
>>>
>>> The other thing is that my two servers are now running upstream Xen and
>>> kernel and I might not go back to both old versions in Debian stretch. The
>>> other server had always running upstream versions and had never a problem,
>>> that’s why I updated the other, too.
>>>
>>>
>>> Best regards
>>> Volker
>>>
>>>
>>> Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com>:
>>>
>>> I was wondering if any of you guys reported this bug/issue/problem back
>>> to the Debian community? For example on their bugs.debian org web site?
>>>
>>> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl> wrote:
>>>
>>>> Hi,
>>>>
>>>> I had these crash problems with the Xen version in Debian stretch, too.
>>>> After 3 to 7 days the Xen server rebooted without log entry or something
>>>> else to observe. The problems started when the first patches were applied
>>>> by Debian. Some updates made it better, the last worse again. I checked
>>>> hard drives, RAM and closely monitored metrics what might be the cause.
>>>>
>>>> My solution after no longer suspecting a hardware fault: build upstream
>>>> Xen 4.11 for Debian stretch. I am currently running this setup with my own
>>>> build of kernel 4.19. The machines are now working stable again.
>>>>
>>>>
>>>> Volker
>>>>
>>>>
>>>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
>>>> ***@webpower.nl>:
>>>>
>>>> Hi there,
>>>>
>>>> Ever since all the Meltdown and Spectre kernel updates and possibly
>>>> also Xen 4.8 updates, we experience crashes of the Dom0 just out of the
>>>> blue. Sometimes after 1 day, sometimes after a few days or even 14 days,
>>>> completely random.
>>>>
>>>> We have two Dell P730 servers and two Dell P720 servers with this
>>>> behaviour. One thing is that we updated these machine to the latest
>>>> available firmware, because that is the most secure way. Then we installed
>>>> Debian Stretch with Xen 4.8 support
>>>>
>>>> We have done serveral installs and 4 servers seem to crash pretty fast
>>>> and other don't. In the end we think that we can lead it back to the
>>>> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
>>>> This was kinda independent of the kernel that we were using 4.14 or
>>>> 4.9.0-8-amd64. This is off course all Debian package numbering.
>>>>
>>>> As last resort we updated on one server all DomU kernels of our Jessie
>>>> servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel.
>>>> For now that seems to work, but the crashes are random so it could happen
>>>> any time again. The idea is that these kernels are completely spectre&
>>>> meltdown unaware and might cause trouble in Xen kernel support. I am not
>>>> sure if this is true at all, but we are pretty lost what the actual cause
>>>> is.
>>>>
>>>> We also tested with CentOS and we also had these crashes there with
>>>> certain combinations of kernel/Xen. The most recent updates seem to be more
>>>> stable tough. The most frustrating part is the there is absolutely no logs
>>>> to be found. No kernel oops or what.. the server just resets and boots
>>>> again.
>>>>
>>>> Are there others experiencing problems like this? Do you see more
>>>> frequent server/kernel crashes on production servers?
>>>>
>>>> Best regards,
>>>> Roalt Zijlstra
>>>>
>>>> _______________________________________________
>>>> Xen-users mailing list
>>>> Xen-***@lists.xenproject.org
>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>
>>>> _______________________________________________
>>>> Xen-users mailing list
>>>> Xen-***@lists.xenproject.org
>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>
>>> _______________________________________________
>>> Xen-users mailing list
>>> Xen-***@lists.xenproject.org
>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>
>
Roalt Zijlstra | webpower
2018-11-06 08:37:33 UTC
Permalink
Hi John,

Yes, we are using PV only and we only run Debian Linux on the servers. We
still have some DomU Jessie servers running with the stock kernel. We did
update our Dells to the latest firmware so it does include more recent
intel microcode with that. But on Debian we did not yet enable the
intel-firmware yet, since we had so much instability and so much parameters
that could be the culprit, we did not want to add another.
If your server is very busy, I think the chance to have a crash is higher.
We have seen crashes on our active MySQL databases whereas the slave MySQL
database server did not crash that quickly, however after using the slave
MySQL database as primary database for a while (because we were debugging
the crashed master database) it could very well happen that the slave would
crash too.

We have done tests with downgrading firmware of Dell (which also means
using an older intel microcode) but that did not help. So having the latest
firmware is okay.
We are now testing a few scenarios:

- one server with an older kernel (4.9.0-4-amd64), with DomU 3.16
kernel, which runs for 16 days now
- one server with the updated -kernel (4.9.0-8-amd64), with DomU 3.16
kernel, which runs for 28 days now surprisingly
- one server with the updated -kernel (4.9.0-8-amd64), and all DomUs on
the backported 4.9 kernel.

It all doesn't really make much sense. We do have the expectation that the
older kernel will keep on running and that the 4.9 DomUs will help to keep
the servers alive.
We have tested with 4.14 and 4.16 kernels (from backports) but that did not
make a difference in stability.

Best regards,

Roalt Zijlstra
Teamleader Infra & Deliverability

***@webpower.nl
+31 342 423 262
roalt.zijlstra
https://www.webpower-group.com

<https://www.webpower-group.com/>

[image: Facebook]
<https://www.facebook.com/webpower.marketingautomation/> [image:
Twitter] <https://twitter.com/webpower> [image: Linkedin]
<https://www.linkedin.com/company/36782/>
Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
Hamburg | Shanghai | Shenzhen | Stockholm
<https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>


Op ma 5 nov. 2018 om 18:24 schreef John Naggets <***@gmail.com>:

> It could be as you mention... your domU are they PV? I am using
> paravirtualization exclusively and on this specific server have the
> following CPU:
>
> Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
>
> Do you have the intel-microcode Debian package from the non-free repo
> installed on your servers? I currently don't...
>
> J.
>
>
> On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower <
> ***@webpower.nl> wrote:
>
>> Hi John,
>>
>> It could very well be that it is also restricted to some CPUs, but I am
>> inclinded to believe that the used DomU kernels can influence stability.
>> We did have a pretty busy SSL offloader running on a 3.16 kernel, which
>> might have caused the crashes.
>>
>> Just for reference, we have the following two CPUs causing us trouble,
>> but I am not sure if it matters.
>> Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>> Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>>
>> Roalt
>>
>>
>> Op ma 5 nov. 2018 om 10:45 schreef John Naggets <***@gmail.com
>> >:
>>
>>> Hi,
>>>
>>> Thanks for your feedback. I was wondering because I have just upgraded a
>>> Debian 9 server to the latest kernel with the latest Xen packages from the
>>> official Debian repo. The only difference is that I have an older IBM
>>> server which is already ~7 years old patched with the latest BIOS/UEFI and
>>> so far so good no crash. The uptime is 6 days for now. Here are the details
>>> about my kernel and xen packages.
>>>
>>> ii xen-hypervisor-4.8-amd64
>>> 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64 Xen Hypervisor on
>>> AMD64
>>> ii linux-image-4.9.0-8-amd64
>>> 4.9.110-3+deb9u6 amd64 Linux 4.9 for 64-bit
>>> PCs
>>>
>>> Regards,
>>> J.
>>>
>>>
>>> On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <***@janzen.onl> wrote:
>>>
>>>> Hi John,
>>>>
>>>> the problem is that I cannot provide any metrics or logfiles showing an
>>>> error. I can only tell that dom0 is rebooting for a reason that is not
>>>> logged. I have no physical access to the server. I got one other report
>>>> about this kind of issue.
>>>>
>>>> My assumption the cause are the backported patches is based on the
>>>> current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It
>>>> won’t be a useful bug report from my point of view.
>>>>
>>>> The other thing is that my two servers are now running upstream Xen and
>>>> kernel and I might not go back to both old versions in Debian stretch. The
>>>> other server had always running upstream versions and had never a problem,
>>>> that’s why I updated the other, too.
>>>>
>>>>
>>>> Best regards
>>>> Volker
>>>>
>>>>
>>>> Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com>:
>>>>
>>>> I was wondering if any of you guys reported this bug/issue/problem back
>>>> to the Debian community? For example on their bugs.debian org web site?
>>>>
>>>> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I had these crash problems with the Xen version in Debian stretch,
>>>>> too. After 3 to 7 days the Xen server rebooted without log entry or
>>>>> something else to observe. The problems started when the first patches were
>>>>> applied by Debian. Some updates made it better, the last worse again. I
>>>>> checked hard drives, RAM and closely monitored metrics what might be the
>>>>> cause.
>>>>>
>>>>> My solution after no longer suspecting a hardware fault: build
>>>>> upstream Xen 4.11 for Debian stretch. I am currently running this setup
>>>>> with my own build of kernel 4.19. The machines are now working stable again.
>>>>>
>>>>>
>>>>> Volker
>>>>>
>>>>>
>>>>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
>>>>> ***@webpower.nl>:
>>>>>
>>>>> Hi there,
>>>>>
>>>>> Ever since all the Meltdown and Spectre kernel updates and possibly
>>>>> also Xen 4.8 updates, we experience crashes of the Dom0 just out of the
>>>>> blue. Sometimes after 1 day, sometimes after a few days or even 14 days,
>>>>> completely random.
>>>>>
>>>>> We have two Dell P730 servers and two Dell P720 servers with this
>>>>> behaviour. One thing is that we updated these machine to the latest
>>>>> available firmware, because that is the most secure way. Then we installed
>>>>> Debian Stretch with Xen 4.8 support
>>>>>
>>>>> We have done serveral installs and 4 servers seem to crash pretty fast
>>>>> and other don't. In the end we think that we can lead it back to the
>>>>> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
>>>>> This was kinda independent of the kernel that we were using 4.14 or
>>>>> 4.9.0-8-amd64. This is off course all Debian package numbering.
>>>>>
>>>>> As last resort we updated on one server all DomU kernels of our
>>>>> Jessie servers on this Dom0 to 4.9.0 from backports instead of the 3.16
>>>>> kernel. For now that seems to work, but the crashes are random so it could
>>>>> happen any time again. The idea is that these kernels are completely
>>>>> spectre& meltdown unaware and might cause trouble in Xen kernel support. I
>>>>> am not sure if this is true at all, but we are pretty lost what the actual
>>>>> cause is.
>>>>>
>>>>> We also tested with CentOS and we also had these crashes there with
>>>>> certain combinations of kernel/Xen. The most recent updates seem to be more
>>>>> stable tough. The most frustrating part is the there is absolutely no logs
>>>>> to be found. No kernel oops or what.. the server just resets and boots
>>>>> again.
>>>>>
>>>>> Are there others experiencing problems like this? Do you see more
>>>>> frequent server/kernel crashes on production servers?
>>>>>
>>>>> Best regards,
>>>>> Roalt Zijlstra
>>>>>
>>>>> _______________________________________________
>>>>> Xen-users mailing list
>>>>> Xen-***@lists.xenproject.org
>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>>
>>>>> _______________________________________________
>>>>> Xen-users mailing list
>>>>> Xen-***@lists.xenproject.org
>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>
>>>> _______________________________________________
>>>> Xen-users mailing list
>>>> Xen-***@lists.xenproject.org
>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>
>>> _______________________________________________
>>> Xen-users mailing list
>>> Xen-***@lists.xenproject.org
>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>
>>
Michael
2018-11-06 09:08:02 UTC
Permalink
Hello,


i had the same Issues.
In my case i tried
Ubuntu 18.04 with xen 4.9 and the Kernel Version 4.15.9 was the only one
wo has start up the DomU.

Tested on AMD Ryzen 1800X and Intel 8700.

In my case i got random system freezes Uptimes between 7 and 30 Days.

Older and never Kernels wont run.
This Problem is still present, i going to switch all Services to Docker...

Regards,
Michael





Am 06.11.2018 um 09:37 schrieb Roalt Zijlstra | webpower:
> Hi John,
>
> Yes, we are using PV only and we only run Debian Linux on the servers.
> We still have some DomU Jessie servers running with the stock kernel.
> We did update our Dells to the latest firmware so it does include more
> recent intel microcode with that. But on Debian we did not yet enable
> the intel-firmware yet, since we had so much instability and so much
> parameters that could be the culprit, we did not want to add another.
> If your server is very busy, I think the chance to have a crash is
> higher. We have seen crashes on our active MySQL databases whereas the
> slave MySQL database server did not crash that quickly, however after
> using the slave MySQL database as primary database for a while
> (because we were debugging the crashed master database) it could very
> well happen that the slave would crash too.
>
> We have done tests with downgrading firmware of Dell (which also means
> using an older intel microcode) but that did not help. So having the
> latest firmware is okay. 
> We are now testing a few scenarios:
>
> *  one server with an older kernel (4.9.0-4-amd64), with DomU 3.16
> kernel, which runs for 16 days now
> *  one server with the updated -kernel (4.9.0-8-amd64), with DomU
> 3.16 kernel, which runs for 28 days now surprisingly
> *  one server with the updated -kernel (4.9.0-8-amd64), and all
> DomUs on the backported 4.9 kernel.
>
> It all doesn't really make much sense. We do have the expectation that
> the older kernel will keep on running and that the 4.9 DomUs will help
> to keep the servers alive. 
> We have tested with 4.14 and 4.16 kernels (from backports) but that
> did not make a difference in stability.
>
> Best regards,
>  
>
> [Naam]   Roalt Zijlstra
>     Teamleader Infra & Deliverability
>      
> [Email]   ***@webpower.nl
> <mailto:***@webpower.nl>
> [Phone]   +31 342 423 262
> [Skype]   roalt.zijlstra
> [Phone]   https://www.webpower-group.com
> <https://www.webpower-group.com/>
>  
>
>
> [Webpower] <https://www.webpower-group.com/>
>  
> Facebook <https://www.facebook.com/webpower.marketingautomation/>  
> Twitter <https://twitter.com/webpower>   Linkedin
> <https://www.linkedin.com/company/36782/>
>
>
> Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
> Hamburg | Shanghai | Shenzhen | Stockholm  
>
>
> <https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>
>
>
>
> Op ma 5 nov. 2018 om 18:24 schreef John Naggets
> <***@gmail.com <mailto:***@gmail.com>>:
>
> It could be as you mention... your domU are they PV? I am using
> paravirtualization exclusively and on this specific server have
> the following CPU:
>
> Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
>
> Do you have the intel-microcode Debian package from the non-free
> repo installed on your servers? I currently don't...
>
> J.
>
>
> On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower
> <***@webpower.nl <mailto:***@webpower.nl>>
> wrote:
>
> Hi John,
>
> It could very well be that it is also restricted to some CPUs,
> but I am inclinded to believe that the used DomU kernels can
> influence stability.  We did have a pretty busy SSL offloader
> running on a 3.16 kernel, which might have caused the crashes. 
>
> Just for reference, we have the following two CPUs causing us
> trouble, but I am not sure if it matters.
> Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
> Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>
> Roalt
>
>
> Op ma 5 nov. 2018 om 10:45 schreef John Naggets
> <***@gmail.com <mailto:***@gmail.com>>:
>
> Hi,
>
> Thanks for your feedback. I was wondering because I have
> just upgraded a Debian 9 server to the latest kernel with
> the latest Xen packages from the official Debian repo. The
> only difference is that I have an older IBM server which
> is already ~7 years old patched with the latest BIOS/UEFI
> and so far so good no crash. The uptime is 6 days for now.
> Here are the details about my kernel and xen packages.
>
> ii  xen-hypervisor-4.8-amd64      
> 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64        Xen
> Hypervisor on AMD64
> ii  linux-image-4.9.0-8-amd64     
> 4.9.110-3+deb9u6                         amd64       
> Linux 4.9 for 64-bit PCs
>
> Regards,
> J.
>
>
> On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen
> <***@janzen.onl> wrote:
>
> Hi John,
>
> the problem is that I cannot provide any metrics or
> logfiles showing an error. I can only tell that dom0
> is rebooting for a reason that is not logged. I have
> no physical access to the server. I got one other
> report about this kind of issue.
>
> My assumption the cause are the backported patches is
> based on the current 16 day uptime. 16 days ago the
> server rebooted every 3-5 days. It won’t be a useful
> bug report from my point of view.
>
> The other thing is that my two servers are now running
> upstream Xen and kernel and I might not go back to
> both old versions in Debian stretch. The other server
> had always running upstream versions and had never a
> problem, that’s why I updated the other, too.
>
>
> Best regards
>     Volker
>
>
> Am 02.11.2018 um 17:23 schrieb John Naggets
> <***@gmail.com
> <mailto:***@gmail.com>>:
>
>> I was wondering if any of you guys reported this
>> bug/issue/problem back to the Debian community? For
>> example on their bugs.debian org web site?
>>
>> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen
>> <***@janzen.onl <mailto:***@janzen.onl>> wrote:
>>
>> Hi,
>>
>> I had these crash problems with the Xen version
>> in Debian stretch, too. After 3 to 7 days the Xen
>> server rebooted without log entry or something
>> else to observe. The problems started when the
>> first patches were applied by Debian. Some
>> updates made it better, the last worse again. I
>> checked hard drives, RAM and closely monitored
>> metrics what might be the cause.
>>
>> My solution after no longer suspecting a hardware
>> fault: build upstream Xen 4.11 for Debian
>> stretch. I am currently running this setup with
>> my own build of kernel 4.19. The machines are now
>> working stable again.
>>
>>
>>     Volker
>>
>>
>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra |
>> webpower <***@webpower.nl
>> <mailto:***@webpower.nl>>:
>>
>>> Hi there,
>>>
>>> Ever since all the Meltdown and Spectre kernel
>>> updates and possibly also Xen 4.8 updates, we
>>> experience crashes of the Dom0 just out of the
>>> blue. Sometimes after 1 day, sometimes after a
>>> few days or even 14 days, completely random.
>>>
>>> We have two Dell P730 servers and two Dell P720
>>> servers with this behaviour. One thing is that
>>> we updated these machine to the latest available
>>> firmware, because that is the most secure way.
>>> Then we installed Debian Stretch with Xen 4.8
>>> support
>>>
>>> We have done serveral installs and 4 servers
>>> seem to crash pretty fast and other don't. In
>>> the end we think that we can lead it back to the
>>> xen-4.8.4-pre version being stable and the
>>> xen-4.8.5-pre being unstable. This was kinda
>>> independent of the kernel that we were using
>>> 4.14 or 4.9.0-8-amd64. This is off course all
>>> Debian package numbering.
>>>
>>> As last resort  we updated on one server all
>>> DomU kernels of our Jessie servers on this Dom0
>>> to 4.9.0 from backports instead of the 3.16
>>> kernel. For now that seems to work, but the
>>> crashes are random so it could happen any time
>>> again. The idea is that these kernels are
>>> completely spectre& meltdown unaware and might
>>> cause trouble in Xen kernel support. I am not
>>> sure if this is true at all, but we are pretty
>>> lost what the actual cause is.
>>>
>>> We also tested with CentOS and we also had these
>>> crashes there with certain combinations of
>>> kernel/Xen. The most recent updates seem to be
>>> more stable tough. The most frustrating part is
>>> the there is absolutely no logs to be found. No
>>> kernel oops or what.. the server just resets and
>>> boots again.
>>>
>>> Are there others experiencing problems like
>>> this? Do you see more frequent server/kernel
>>> crashes on production servers?  
>>>
>>> Best regards,
>>>  
>>>
>>> Roalt Zijlstra
>>>
>>> _______________________________________________
>>> Xen-users mailing list
>>> Xen-***@lists.xenproject.org
>>> <mailto:Xen-***@lists.xenproject.org>
>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> <mailto:Xen-***@lists.xenproject.org>
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> <mailto:Xen-***@lists.xenproject.org>
> https://lists.xenproject.org/mailman/listinfo/xen-users
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> <mailto:Xen-***@lists.xenproject.org>
> https://lists.xenproject.org/mailman/listinfo/xen-users
>
>
>
> _______________________________________________

> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
Roalt Zijlstra | webpower
2018-11-06 11:40:24 UTC
Permalink
Hi Michael,

I am not sure about the status of Ubuntu and Xen.
My advise would be to downgrade you Xen version to the previous version and
see if that is more stable. For Debian that worked, it is less secure, but
crashing servers is not what you want. Maybe that an updated Xen will have
stability fixes.


Best regards,

Roalt Zijlstra
Teamleader Infra & Deliverability

***@webpower.nl
+31 342 423 262
roalt.zijlstra
https://www.webpower-group.com

<https://www.webpower-group.com/>

[image: Facebook]
<https://www.facebook.com/webpower.marketingautomation/> [image:
Twitter] <https://twitter.com/webpower> [image: Linkedin]
<https://www.linkedin.com/company/36782/>
Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
Hamburg | Shanghai | Shenzhen | Stockholm
<https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>


Op di 6 nov. 2018 om 10:19 schreef Michael <***@gmx.de>:

> Hello,
>
>
> i had the same Issues.
> In my case i tried
> Ubuntu 18.04 with xen 4.9 and the Kernel Version 4.15.9 was the only one
> wo has start up the DomU.
>
> Tested on AMD Ryzen 1800X and Intel 8700.
>
> In my case i got random system freezes Uptimes between 7 and 30 Days.
>
> Older and never Kernels wont run.
> This Problem is still present, i going to switch all Services to Docker...
>
> Regards,
> Michael
>
>
>
>
>
> Am 06.11.2018 um 09:37 schrieb Roalt Zijlstra | webpower:
>
> Hi John,
>
> Yes, we are using PV only and we only run Debian Linux on the servers. We
> still have some DomU Jessie servers running with the stock kernel. We did
> update our Dells to the latest firmware so it does include more recent
> intel microcode with that. But on Debian we did not yet enable the
> intel-firmware yet, since we had so much instability and so much parameters
> that could be the culprit, we did not want to add another.
> If your server is very busy, I think the chance to have a crash is higher.
> We have seen crashes on our active MySQL databases whereas the slave MySQL
> database server did not crash that quickly, however after using the slave
> MySQL database as primary database for a while (because we were debugging
> the crashed master database) it could very well happen that the slave would
> crash too.
>
> We have done tests with downgrading firmware of Dell (which also means
> using an older intel microcode) but that did not help. So having the latest
> firmware is okay.
> We are now testing a few scenarios:
>
> - one server with an older kernel (4.9.0-4-amd64), with DomU 3.16
> kernel, which runs for 16 days now
> - one server with the updated -kernel (4.9.0-8-amd64), with DomU 3.16
> kernel, which runs for 28 days now surprisingly
> - one server with the updated -kernel (4.9.0-8-amd64), and all DomUs
> on the backported 4.9 kernel.
>
> It all doesn't really make much sense. We do have the expectation that the
> older kernel will keep on running and that the 4.9 DomUs will help to keep
> the servers alive.
> We have tested with 4.14 and 4.16 kernels (from backports) but that did
> not make a difference in stability.
>
> Best regards,
>
> Roalt Zijlstra
> Teamleader Infra & Deliverability
>
> ***@webpower.nl
> +31 342 423 262
> roalt.zijlstra
> https://www.webpower-group.com
>
>
> <https://www.webpower-group.com/>
>
> [image: Facebook] <https://www.facebook.com/webpower.marketingautomation/>
> [image: Twitter] <https://twitter.com/webpower> [image: Linkedin]
> <https://www.linkedin.com/company/36782/>
>
> Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
> Hamburg | Shanghai | Shenzhen | Stockholm
>
>
> <https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>
>
>
> Op ma 5 nov. 2018 om 18:24 schreef John Naggets <***@gmail.com
> >:
>
>> It could be as you mention... your domU are they PV? I am using
>> paravirtualization exclusively and on this specific server have the
>> following CPU:
>>
>> Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
>>
>> Do you have the intel-microcode Debian package from the non-free repo
>> installed on your servers? I currently don't...
>>
>> J.
>>
>>
>> On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower <
>> ***@webpower.nl> wrote:
>>
>>> Hi John,
>>>
>>> It could very well be that it is also restricted to some CPUs, but I am
>>> inclinded to believe that the used DomU kernels can influence stability.
>>> We did have a pretty busy SSL offloader running on a 3.16 kernel, which
>>> might have caused the crashes.
>>>
>>> Just for reference, we have the following two CPUs causing us trouble,
>>> but I am not sure if it matters.
>>> Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>> Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>>>
>>> Roalt
>>>
>>>
>>> Op ma 5 nov. 2018 om 10:45 schreef John Naggets <
>>> ***@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> Thanks for your feedback. I was wondering because I have just upgraded
>>>> a Debian 9 server to the latest kernel with the latest Xen packages from
>>>> the official Debian repo. The only difference is that I have an older IBM
>>>> server which is already ~7 years old patched with the latest BIOS/UEFI and
>>>> so far so good no crash. The uptime is 6 days for now. Here are the details
>>>> about my kernel and xen packages.
>>>>
>>>> ii xen-hypervisor-4.8-amd64
>>>> 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64 Xen Hypervisor on
>>>> AMD64
>>>> ii linux-image-4.9.0-8-amd64
>>>> 4.9.110-3+deb9u6 amd64 Linux 4.9 for 64-bit
>>>> PCs
>>>>
>>>> Regards,
>>>> J.
>>>>
>>>>
>>>> On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <***@janzen.onl>
>>>> <***@janzen.onl> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> the problem is that I cannot provide any metrics or logfiles showing
>>>>> an error. I can only tell that dom0 is rebooting for a reason that is not
>>>>> logged. I have no physical access to the server. I got one other report
>>>>> about this kind of issue.
>>>>>
>>>>> My assumption the cause are the backported patches is based on the
>>>>> current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It
>>>>> won’t be a useful bug report from my point of view.
>>>>>
>>>>> The other thing is that my two servers are now running upstream Xen
>>>>> and kernel and I might not go back to both old versions in Debian stretch.
>>>>> The other server had always running upstream versions and had never a
>>>>> problem, that’s why I updated the other, too.
>>>>>
>>>>>
>>>>> Best regards
>>>>> Volker
>>>>>
>>>>>
>>>>> Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com
>>>>> >:
>>>>>
>>>>> I was wondering if any of you guys reported this bug/issue/problem
>>>>> back to the Debian community? For example on their bugs.debian org web site?
>>>>>
>>>>> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I had these crash problems with the Xen version in Debian stretch,
>>>>>> too. After 3 to 7 days the Xen server rebooted without log entry or
>>>>>> something else to observe. The problems started when the first patches were
>>>>>> applied by Debian. Some updates made it better, the last worse again. I
>>>>>> checked hard drives, RAM and closely monitored metrics what might be the
>>>>>> cause.
>>>>>>
>>>>>> My solution after no longer suspecting a hardware fault: build
>>>>>> upstream Xen 4.11 for Debian stretch. I am currently running this setup
>>>>>> with my own build of kernel 4.19. The machines are now working stable again.
>>>>>>
>>>>>>
>>>>>> Volker
>>>>>>
>>>>>>
>>>>>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
>>>>>> ***@webpower.nl>:
>>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> Ever since all the Meltdown and Spectre kernel updates and possibly
>>>>>> also Xen 4.8 updates, we experience crashes of the Dom0 just out of the
>>>>>> blue. Sometimes after 1 day, sometimes after a few days or even 14 days,
>>>>>> completely random.
>>>>>>
>>>>>> We have two Dell P730 servers and two Dell P720 servers with this
>>>>>> behaviour. One thing is that we updated these machine to the latest
>>>>>> available firmware, because that is the most secure way. Then we installed
>>>>>> Debian Stretch with Xen 4.8 support
>>>>>>
>>>>>> We have done serveral installs and 4 servers seem to crash pretty
>>>>>> fast and other don't. In the end we think that we can lead it back to the
>>>>>> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
>>>>>> This was kinda independent of the kernel that we were using 4.14 or
>>>>>> 4.9.0-8-amd64. This is off course all Debian package numbering.
>>>>>>
>>>>>> As last resort we updated on one server all DomU kernels of our
>>>>>> Jessie servers on this Dom0 to 4.9.0 from backports instead of the 3.16
>>>>>> kernel. For now that seems to work, but the crashes are random so it could
>>>>>> happen any time again. The idea is that these kernels are completely
>>>>>> spectre& meltdown unaware and might cause trouble in Xen kernel support. I
>>>>>> am not sure if this is true at all, but we are pretty lost what the actual
>>>>>> cause is.
>>>>>>
>>>>>> We also tested with CentOS and we also had these crashes there with
>>>>>> certain combinations of kernel/Xen. The most recent updates seem to be more
>>>>>> stable tough. The most frustrating part is the there is absolutely no logs
>>>>>> to be found. No kernel oops or what.. the server just resets and boots
>>>>>> again.
>>>>>>
>>>>>> Are there others experiencing problems like this? Do you see more
>>>>>> frequent server/kernel crashes on production servers?
>>>>>>
>>>>>> Best regards,
>>>>>> Roalt Zijlstra
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-users mailing list
>>>>>> Xen-***@lists.xenproject.org
>>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-users mailing list
>>>>>> Xen-***@lists.xenproject.org
>>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>>
>>>>> _______________________________________________
>>>>> Xen-users mailing list
>>>>> Xen-***@lists.xenproject.org
>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>
>>>> _______________________________________________
>>>> Xen-users mailing list
>>>> Xen-***@lists.xenproject.org
>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>
>>>
>
> _______________________________________________
> Xen-users mailing listXen-***@lists.xenproject.orghttps://lists.xenproject.org/mailman/listinfo/xen-users
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
John Naggets
2018-11-06 17:10:38 UTC
Permalink
Thanks to both of you for your detailed information. So as you both do not
have the intel-microcode package installed it can't be that the issue. I do
not make use of that package either myself. So what is left? Well it looks
like I am running on older hardware, at least 5 years old hardware and who
knows if this has some kind of influence. It might be interesting to get in
touch with the hardware manufacturer (DELL?) and ask them if they have
other customers with this issue. The only problem here is that as soon as
you mention Debian they will stop listening to you :( If I remember
correctly they only take support cases for supported commercial Linux
distributions which basically boils down to RHEL and SLES... Maybe the DELL
forums would be a better alternative. I would definitely recommend filling
a bug issue with Debian and maybe even Xen... If you have some kind of
stack trace that would also be interesting to see.

J.

On Tue, Nov 6, 2018 at 9:37 AM Roalt Zijlstra | webpower <
***@webpower.nl> wrote:

> Hi John,
>
> Yes, we are using PV only and we only run Debian Linux on the servers. We
> still have some DomU Jessie servers running with the stock kernel. We did
> update our Dells to the latest firmware so it does include more recent
> intel microcode with that. But on Debian we did not yet enable the
> intel-firmware yet, since we had so much instability and so much parameters
> that could be the culprit, we did not want to add another.
> If your server is very busy, I think the chance to have a crash is higher.
> We have seen crashes on our active MySQL databases whereas the slave MySQL
> database server did not crash that quickly, however after using the slave
> MySQL database as primary database for a while (because we were debugging
> the crashed master database) it could very well happen that the slave would
> crash too.
>
> We have done tests with downgrading firmware of Dell (which also means
> using an older intel microcode) but that did not help. So having the latest
> firmware is okay.
> We are now testing a few scenarios:
>
> - one server with an older kernel (4.9.0-4-amd64), with DomU 3.16
> kernel, which runs for 16 days now
> - one server with the updated -kernel (4.9.0-8-amd64), with DomU 3.16
> kernel, which runs for 28 days now surprisingly
> - one server with the updated -kernel (4.9.0-8-amd64), and all DomUs
> on the backported 4.9 kernel.
>
> It all doesn't really make much sense. We do have the expectation that the
> older kernel will keep on running and that the 4.9 DomUs will help to keep
> the servers alive.
> We have tested with 4.14 and 4.16 kernels (from backports) but that did
> not make a difference in stability.
>
> Best regards,
>
> Roalt Zijlstra
> Teamleader Infra & Deliverability
>
> ***@webpower.nl
> +31 342 423 262
> roalt.zijlstra
> https://www.webpower-group.com
>
> <https://www.webpower-group.com/>
>
> [image: Facebook] <https://www.facebook.com/webpower.marketingautomation/>
> [image: Twitter] <https://twitter.com/webpower> [image: Linkedin]
> <https://www.linkedin.com/company/36782/>
> Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
> Hamburg | Shanghai | Shenzhen | Stockholm
>
> <https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>
>
>
> Op ma 5 nov. 2018 om 18:24 schreef John Naggets <***@gmail.com
> >:
>
>> It could be as you mention... your domU are they PV? I am using
>> paravirtualization exclusively and on this specific server have the
>> following CPU:
>>
>> Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
>>
>> Do you have the intel-microcode Debian package from the non-free repo
>> installed on your servers? I currently don't...
>>
>> J.
>>
>>
>> On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower <
>> ***@webpower.nl> wrote:
>>
>>> Hi John,
>>>
>>> It could very well be that it is also restricted to some CPUs, but I am
>>> inclinded to believe that the used DomU kernels can influence stability.
>>> We did have a pretty busy SSL offloader running on a 3.16 kernel, which
>>> might have caused the crashes.
>>>
>>> Just for reference, we have the following two CPUs causing us trouble,
>>> but I am not sure if it matters.
>>> Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>> Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>>>
>>> Roalt
>>>
>>>
>>> Op ma 5 nov. 2018 om 10:45 schreef John Naggets <
>>> ***@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> Thanks for your feedback. I was wondering because I have just upgraded
>>>> a Debian 9 server to the latest kernel with the latest Xen packages from
>>>> the official Debian repo. The only difference is that I have an older IBM
>>>> server which is already ~7 years old patched with the latest BIOS/UEFI and
>>>> so far so good no crash. The uptime is 6 days for now. Here are the details
>>>> about my kernel and xen packages.
>>>>
>>>> ii xen-hypervisor-4.8-amd64
>>>> 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64 Xen Hypervisor on
>>>> AMD64
>>>> ii linux-image-4.9.0-8-amd64
>>>> 4.9.110-3+deb9u6 amd64 Linux 4.9 for 64-bit
>>>> PCs
>>>>
>>>> Regards,
>>>> J.
>>>>
>>>>
>>>> On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <***@janzen.onl> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> the problem is that I cannot provide any metrics or logfiles showing
>>>>> an error. I can only tell that dom0 is rebooting for a reason that is not
>>>>> logged. I have no physical access to the server. I got one other report
>>>>> about this kind of issue.
>>>>>
>>>>> My assumption the cause are the backported patches is based on the
>>>>> current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It
>>>>> won’t be a useful bug report from my point of view.
>>>>>
>>>>> The other thing is that my two servers are now running upstream Xen
>>>>> and kernel and I might not go back to both old versions in Debian stretch.
>>>>> The other server had always running upstream versions and had never a
>>>>> problem, that’s why I updated the other, too.
>>>>>
>>>>>
>>>>> Best regards
>>>>> Volker
>>>>>
>>>>>
>>>>> Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com
>>>>> >:
>>>>>
>>>>> I was wondering if any of you guys reported this bug/issue/problem
>>>>> back to the Debian community? For example on their bugs.debian org web site?
>>>>>
>>>>> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I had these crash problems with the Xen version in Debian stretch,
>>>>>> too. After 3 to 7 days the Xen server rebooted without log entry or
>>>>>> something else to observe. The problems started when the first patches were
>>>>>> applied by Debian. Some updates made it better, the last worse again. I
>>>>>> checked hard drives, RAM and closely monitored metrics what might be the
>>>>>> cause.
>>>>>>
>>>>>> My solution after no longer suspecting a hardware fault: build
>>>>>> upstream Xen 4.11 for Debian stretch. I am currently running this setup
>>>>>> with my own build of kernel 4.19. The machines are now working stable again.
>>>>>>
>>>>>>
>>>>>> Volker
>>>>>>
>>>>>>
>>>>>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
>>>>>> ***@webpower.nl>:
>>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> Ever since all the Meltdown and Spectre kernel updates and possibly
>>>>>> also Xen 4.8 updates, we experience crashes of the Dom0 just out of the
>>>>>> blue. Sometimes after 1 day, sometimes after a few days or even 14 days,
>>>>>> completely random.
>>>>>>
>>>>>> We have two Dell P730 servers and two Dell P720 servers with this
>>>>>> behaviour. One thing is that we updated these machine to the latest
>>>>>> available firmware, because that is the most secure way. Then we installed
>>>>>> Debian Stretch with Xen 4.8 support
>>>>>>
>>>>>> We have done serveral installs and 4 servers seem to crash pretty
>>>>>> fast and other don't. In the end we think that we can lead it back to the
>>>>>> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
>>>>>> This was kinda independent of the kernel that we were using 4.14 or
>>>>>> 4.9.0-8-amd64. This is off course all Debian package numbering.
>>>>>>
>>>>>> As last resort we updated on one server all DomU kernels of our
>>>>>> Jessie servers on this Dom0 to 4.9.0 from backports instead of the 3.16
>>>>>> kernel. For now that seems to work, but the crashes are random so it could
>>>>>> happen any time again. The idea is that these kernels are completely
>>>>>> spectre& meltdown unaware and might cause trouble in Xen kernel support. I
>>>>>> am not sure if this is true at all, but we are pretty lost what the actual
>>>>>> cause is.
>>>>>>
>>>>>> We also tested with CentOS and we also had these crashes there with
>>>>>> certain combinations of kernel/Xen. The most recent updates seem to be more
>>>>>> stable tough. The most frustrating part is the there is absolutely no logs
>>>>>> to be found. No kernel oops or what.. the server just resets and boots
>>>>>> again.
>>>>>>
>>>>>> Are there others experiencing problems like this? Do you see more
>>>>>> frequent server/kernel crashes on production servers?
>>>>>>
>>>>>> Best regards,
>>>>>> Roalt Zijlstra
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-users mailing list
>>>>>> Xen-***@lists.xenproject.org
>>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-users mailing list
>>>>>> Xen-***@lists.xenproject.org
>>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>>
>>>>> _______________________________________________
>>>>> Xen-users mailing list
>>>>> Xen-***@lists.xenproject.org
>>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>>
>>>> _______________________________________________
>>>> Xen-users mailing list
>>>> Xen-***@lists.xenproject.org
>>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>>
>>>
johnny Strom
2018-11-07 06:57:29 UTC
Permalink
On 11/6/18 7:10 PM, John Naggets wrote:
> Thanks to both of you for your detailed information. So as you both do
> not have the intel-microcode package installed it can't be that the
> issue. I do not make use of that package either myself. So what is
> left? Well it looks like I am running on older hardware, at least 5
> years old hardware and who knows if this has some kind of influence.
> It might be interesting to get in touch with the hardware manufacturer
> (DELL?) and ask them if they have other customers with this issue. The
> only problem here is that as soon as you mention Debian they will stop
> listening to you :( If I remember correctly they only take support
> cases for supported commercial Linux distributions which basically
> boils down to RHEL and SLES... Maybe the DELL forums would be a better
> alternative. I would definitely recommend filling a bug issue with
> Debian and maybe even Xen... If you have some kind of stack trace that
> would also be interesting to see.


Hi all.

We also use XEN on Debian Strech here is the info.


Server 1: DELL T330 4 CPU about 2.5 years with Intel(R) Xeon(R) CPU
E3-1220 v5 @ 3.00GHz

Latest XEN package from debian intel-microcode 3.20180807a.1~deb9u1 with
kernel 4.9.110-3+deb9u6, Domu with a mix of strech and jessie with
kernels 3.16.59-1 and 4.9.110-3+deb9u6.

This one is stable.


Server 2. DELL R740 6 months old with Intel(R) Xeon(R) Gold 6132 CPU @
2.60GHz

Latest XEN package from debian intel-microcode (3.20180807a.1~deb9u1) 
with kernel 4.9.110-3+deb9u4, Domu with a mix of strech and jessie with
kernels 3.16.59-1 and 4.9.110-3+deb9u6.

This one is stable.


Server 3. LENOVO RD650 about 4 years old with Intel(R) Xeon(R) CPU
E5-2650 v3 @ 2.30GHz with kernel 4.9.110-3+deb9u4

Latest XEN package from debian intel-microcode 3.20180703.2~deb9u1, Domu
with a mix of strech and jessie with kernels 3.16.59-1 and
4.9.110-3+deb9u6 and Centos kernel 4.10.

This one is stable.


On all XEN Dom0 server have we put
GRUB_CMDLINE_XEN_DEFAULT="dom0_mem=2048M,max:2048M" and sched-credit to
512 on dom0.


xl sched-credit
Name                                ID Weight  Cap
Domain-0                             0    512    0


Best regards Johnny



>
> J.
>
> On Tue, Nov 6, 2018 at 9:37 AM Roalt Zijlstra | webpower
> <***@webpower.nl <mailto:***@webpower.nl>> wrote:
>
> Hi John,
>
> Yes, we are using PV only and we only run Debian Linux on the
> servers. We still have some DomU Jessie servers running with the
> stock kernel. We did update our Dells to the latest firmware so it
> does include more recent intel microcode with that. But on Debian
> we did not yet enable the intel-firmware yet, since we had so much
> instability and so much parameters that could be the culprit, we
> did not want to add another.
> If your server is very busy, I think the chance to have a crash is
> higher. We have seen crashes on our active MySQL databases whereas
> the slave MySQL database server did not crash that quickly,
> however after using the slave MySQL database as primary database
> for a while (because we were debugging the crashed master
> database) it could very well happen that the slave would crash too.
>
> We have done tests with downgrading firmware of Dell (which also
> means using an older intel microcode) but that did not help. So
> having the latest firmware is okay.
> We are now testing a few scenarios:
>
> *  one server with an older kernel (4.9.0-4-amd64), with DomU
> 3.16 kernel, which runs for 16 days now
> *  one server with the updated -kernel (4.9.0-8-amd64), with
> DomU 3.16 kernel, which runs for 28 days now surprisingly
> *  one server with the updated -kernel (4.9.0-8-amd64), and all
> DomUs on the backported 4.9 kernel.
>
> It all doesn't really make much sense. We do have the expectation
> that the older kernel will keep on running and that the 4.9 DomUs
> will help to keep the servers alive.
> We have tested with 4.14 and 4.16 kernels (from backports) but
> that did not make a difference in stability.
>
> Best regards,
>
> [Naam] Roalt Zijlstra
> Teamleader Infra & Deliverability
>
> [Email] ***@webpower.nl
> <mailto:***@webpower.nl>
> [Phone] +31 342 423 262
> [Skype] roalt.zijlstra
> [Phone] https://www.webpower-group.com
> <https://www.webpower-group.com/>
>
>
>
> [Webpower] <https://www.webpower-group.com/>
> Facebook <https://www.facebook.com/webpower.marketingautomation/>
> Twitter <https://twitter.com/webpower> Linkedin
> <https://www.linkedin.com/company/36782/>
>
>
> Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
> Hamburg | Shanghai | Shenzhen | Stockholm
>
>
> <https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>
>
>
>
> Op ma 5 nov. 2018 om 18:24 schreef John Naggets
> <***@gmail.com <mailto:***@gmail.com>>:
>
> It could be as you mention... your domU are they PV? I am
> using paravirtualization exclusively and on this specific
> server have the following CPU:
>
> Intel(R) Xeon(R) CPU           E5645  @ 2.40GHz
>
> Do you have the intel-microcode Debian package from the
> non-free repo installed on your servers? I currently don't...
>
> J.
>
>
> On Mon, Nov 5, 2018 at 3:04 PM Roalt Zijlstra | webpower
> <***@webpower.nl
> <mailto:***@webpower.nl>> wrote:
>
> Hi John,
>
> It could very well be that it is also restricted to some
> CPUs, but I am inclinded to believe that the used DomU
> kernels can influence stability.  We did have a pretty
> busy SSL offloader running on a 3.16 kernel, which might
> have caused the crashes.
>
> Just for reference, we have the following two CPUs causing
> us trouble, but I am not sure if it matters.
> Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
> Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
>
> Roalt
>
>
> Op ma 5 nov. 2018 om 10:45 schreef John Naggets
> <***@gmail.com <mailto:***@gmail.com>>:
>
> Hi,
>
> Thanks for your feedback. I was wondering because I
> have just upgraded a Debian 9 server to the latest
> kernel with the latest Xen packages from the official
> Debian repo. The only difference is that I have an
> older IBM server which is already ~7 years old patched
> with the latest BIOS/UEFI and so far so good no crash.
> The uptime is 6 days for now. Here are the details
> about my kernel and xen packages.
>
> ii  xen-hypervisor-4.8-amd64
> 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64       
> Xen Hypervisor on AMD64
> ii  linux-image-4.9.0-8-amd64 4.9.110-3+deb9u6
> amd64        Linux 4.9 for 64-bit PCs
>
> Regards,
> J.
>
>
> On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen
> <***@janzen.onl> wrote:
>
> Hi John,
>
> the problem is that I cannot provide any metrics
> or logfiles showing an error. I can only tell that
> dom0 is rebooting for a reason that is not logged.
> I have no physical access to the server. I got one
> other report about this kind of issue.
>
> My assumption the cause are the backported patches
> is based on the current 16 day uptime. 16 days ago
> the server rebooted every 3-5 days. It won’t be a
> useful bug report from my point of view.
>
> The other thing is that my two servers are now
> running upstream Xen and kernel and I might not go
> back to both old versions in Debian stretch. The
> other server had always running upstream versions
> and had never a problem, that’s why I updated the
> other, too.
>
>
> Best regards
>     Volker
>
>
> Am 02.11.2018 um 17:23 schrieb John Naggets
> <***@gmail.com
> <mailto:***@gmail.com>>:
>
>> I was wondering if any of you guys reported this
>> bug/issue/problem back to the Debian community?
>> For example on their bugs.debian org web site?
>>
>> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen
>> <***@janzen.onl <mailto:***@janzen.onl>> wrote:
>>
>> Hi,
>>
>> I had these crash problems with the Xen
>> version in Debian stretch, too. After 3 to 7
>> days the Xen server rebooted without log
>> entry or something else to observe. The
>> problems started when the first patches were
>> applied by Debian. Some updates made it
>> better, the last worse again. I checked hard
>> drives, RAM and closely monitored metrics
>> what might be the cause.
>>
>> My solution after no longer suspecting a
>> hardware fault: build upstream Xen 4.11 for
>> Debian stretch. I am currently running this
>> setup with my own build of kernel 4.19. The
>> machines are now working stable again.
>>
>>
>>     Volker
>>
>>
>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra
>> | webpower <***@webpower.nl
>> <mailto:***@webpower.nl>>:
>>
>>> Hi there,
>>>
>>> Ever since all the Meltdown and Spectre
>>> kernel updates and possibly also Xen 4.8
>>> updates, we experience crashes of the Dom0
>>> just out of the blue. Sometimes after 1 day,
>>> sometimes after a few days or even 14 days,
>>> completely random.
>>>
>>> We have two Dell P730 servers and two Dell
>>> P720 servers with this behaviour. One thing
>>> is that we updated these machine to the
>>> latest available firmware, because that is
>>> the most secure way. Then we installed
>>> Debian Stretch with Xen 4.8 support
>>>
>>> We have done serveral installs and 4 servers
>>> seem to crash pretty fast and other don't.
>>> In the end we think that we can lead it back
>>> to the xen-4.8.4-pre version being stable
>>> and the xen-4.8.5-pre being unstable. This
>>> was kinda independent of the kernel that we
>>> were using 4.14 or 4.9.0-8-amd64. This is
>>> off course all Debian package numbering.
>>>
>>> As last resort  we updated on one server all
>>> DomU kernels of our Jessie servers on this
>>> Dom0 to 4.9.0 from backports instead of the
>>> 3.16 kernel. For now that seems to work, but
>>> the crashes are random so it could happen
>>> any time again. The idea is that these
>>> kernels are completely spectre& meltdown
>>> unaware and might cause trouble in Xen
>>> kernel support. I am not sure if this is
>>> true at all, but we are pretty lost what the
>>> actual cause is.
>>>
>>> We also tested with CentOS and we also had
>>> these crashes there with certain
>>> combinations of kernel/Xen. The most recent
>>> updates seem to be more stable tough. The
>>> most frustrating part is the there is
>>> absolutely no logs to be found. No kernel
>>> oops or what.. the server just resets and
>>> boots again.
>>>
>>> Are there others experiencing problems like
>>> this? Do you see more frequent server/kernel
>>> crashes on production servers?
>>>
>>> Best regards,
>>>
>>> Roalt Zijlstra
>>>
>>> _______________________________________________
>>> Xen-users mailing list
>>> Xen-***@lists.xenproject.org
>>> <mailto:Xen-***@lists.xenproject.org>
>>> https://lists.xenproject.org/mailman/listinfo/xen-users
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> <mailto:Xen-***@lists.xenproject.org>
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> <mailto:Xen-***@lists.xenproject.org>
> https://lists.xenproject.org/mailman/listinfo/xen-users
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> <mailto:Xen-***@lists.xenproject.org>
> https://lists.xenproject.org/mailman/listinfo/xen-users
>
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
Volker Janzen
2018-11-05 19:52:33 UTC
Permalink
Hi John,

I have a Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz server, I am not sure how old it is, perhaps 2.5 - 3 years. The server has some load, especially disk IO. I’m not sure if the BIOS has been updated. I once followed a post on this list to install the microcode update with intel-microcode package, but I always see the processor bugs in /proc/cpuinfo. I need to verify this on another node first. I am using PV with pygrub for all domUs.

It’s sad that there is no log entry why Xen rebooted. :-(

As said, it’s hard to tell what caused the issue. The uptime might have varied. It is also possible that some Debian kernel updates were more stable than others. The problems went on since January with more or less frequent crashes. The only thing I can tell for now is that the upstream packages have never been affected since I started using them. The only thing that is not pretty that the xen.gz is not generated with version with my upstream build. I did not yet checked/understood why.


Regards
Volker


Am 05.11.2018 um 10:34 schrieb John Naggets <***@gmail.com<mailto:***@gmail.com>>:

Hi,

Thanks for your feedback. I was wondering because I have just upgraded a Debian 9 server to the latest kernel with the latest Xen packages from the official Debian repo. The only difference is that I have an older IBM server which is already ~7 years old patched with the latest BIOS/UEFI and so far so good no crash. The uptime is 6 days for now. Here are the details about my kernel and xen packages.

ii xen-hypervisor-4.8-amd64 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10 amd64 Xen Hypervisor on AMD64
ii linux-image-4.9.0-8-amd64 4.9.110-3+deb9u6 amd64 Linux 4.9 for 64-bit PCs

Regards,
J.


On Fri, Nov 2, 2018 at 7:57 PM Volker Janzen <***@janzen.onl<mailto:***@janzen.onl>> wrote:
Hi John,

the problem is that I cannot provide any metrics or logfiles showing an error. I can only tell that dom0 is rebooting for a reason that is not logged. I have no physical access to the server. I got one other report about this kind of issue.

My assumption the cause are the backported patches is based on the current 16 day uptime. 16 days ago the server rebooted every 3-5 days. It won’t be a useful bug report from my point of view.

The other thing is that my two servers are now running upstream Xen and kernel and I might not go back to both old versions in Debian stretch. The other server had always running upstream versions and had never a problem, that’s why I updated the other, too.


Best regards
Volker


Am 02.11.2018 um 17:23 schrieb John Naggets <***@gmail.com<mailto:***@gmail.com>>:

I was wondering if any of you guys reported this bug/issue/problem back to the Debian community? For example on their bugs.debian org web site?

On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl<mailto:***@janzen.onl>> wrote:
Hi,

I had these crash problems with the Xen version in Debian stretch, too. After 3 to 7 days the Xen server rebooted without log entry or something else to observe. The problems started when the first patches were applied by Debian. Some updates made it better, the last worse again. I checked hard drives, RAM and closely monitored metrics what might be the cause.

My solution after no longer suspecting a hardware fault: build upstream Xen 4.11 for Debian stretch. I am currently running this setup with my own build of kernel 4.19. The machines are now working stable again.


Volker


Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <***@webpower.nl<mailto:***@webpower.nl>>:

Hi there,

Ever since all the Meltdown and Spectre kernel updates and possibly also Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue. Sometimes after 1 day, sometimes after a few days or even 14 days, completely random.

We have two Dell P730 servers and two Dell P720 servers with this behaviour. One thing is that we updated these machine to the latest available firmware, because that is the most secure way. Then we installed Debian Stretch with Xen 4.8 support

We have done serveral installs and 4 servers seem to crash pretty fast and other don't. In the end we think that we can lead it back to the xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable. This was kinda independent of the kernel that we were using 4.14 or 4.9.0-8-amd64. This is off course all Debian package numbering.

As last resort we updated on one server all DomU kernels of our Jessie servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel. For now that seems to work, but the crashes are random so it could happen any time again. The idea is that these kernels are completely spectre& meltdown unaware and might cause trouble in Xen kernel support. I am not sure if this is true at all, but we are pretty lost what the actual cause is.

We also tested with CentOS and we also had these crashes there with certain combinations of kernel/Xen. The most recent updates seem to be more stable tough. The most frustrating part is the there is absolutely no logs to be found. No kernel oops or what.. the server just resets and boots again.

Are there others experiencing problems like this? Do you see more frequent server/kernel crashes on production servers?

Best regards,

Roalt Zijlstra


_______________________________________________
Xen-users mailing list
Xen-***@lists.xenproject.org<mailto:Xen-***@lists.xenproject.org>
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-***@lists.xenproject.org<mailto:Xen-***@lists.xenproject.org>
https://lists.xenproject.org/mailman/listinfo/xen-users
_______________________________________________
Xen-users mailing list
Xen-***@lists.xenproject.org<mailto:Xen-***@lists.xenproject.org>
https://lists.xenproject.org/mailman/listinfo/xen-users
Roalt Zijlstra | webpower
2018-11-02 19:19:21 UTC
Permalink
Hi Volker,

Actually I did not yet. I might do so even though the log-reporting is very
sparse. We also noticed this in CentOS packages and we still figuring out
what was going on. If I get some time I will file a bug. But I am sure it
is updates in kernels and Xen due to spectre/meltdown.

Best regards,

Roalt Zijlstra
Teamleader Infra & Deliverability

***@webpower.nl
+31 342 423 262
roalt.zijlstra
https://www.webpower-group.com

<https://www.webpower-group.com/>

[image: Facebook]
<https://www.facebook.com/webpower.marketingautomation/> [image:
Twitter] <https://twitter.com/webpower> [image: Linkedin]
<https://www.linkedin.com/company/36782/>
Barcelona | Barneveld | Beijing | Chengdu | Guangzhou
Hamburg | Shanghai | Shenzhen | Stockholm
<https://webpower.nl/event/kennissessies/?utm_source=GML&utm_medium=EMAIL&utm_campaign=EVENT&utm_term=KNOWLDGS&utm_content=NL>


Op vr 2 nov. 2018 om 17:36 schreef John Naggets <***@gmail.com>:

> I was wondering if any of you guys reported this bug/issue/problem back to
> the Debian community? For example on their bugs.debian org web site?
>
> On Thu, Nov 1, 2018 at 1:47 PM Volker Janzen <***@janzen.onl> wrote:
>
>> Hi,
>>
>> I had these crash problems with the Xen version in Debian stretch, too.
>> After 3 to 7 days the Xen server rebooted without log entry or something
>> else to observe. The problems started when the first patches were applied
>> by Debian. Some updates made it better, the last worse again. I checked
>> hard drives, RAM and closely monitored metrics what might be the cause.
>>
>> My solution after no longer suspecting a hardware fault: build upstream
>> Xen 4.11 for Debian stretch. I am currently running this setup with my own
>> build of kernel 4.19. The machines are now working stable again.
>>
>>
>> Volker
>>
>>
>> Am 29.10.2018 um 13:13 schrieb Roalt Zijlstra | webpower <
>> ***@webpower.nl>:
>>
>> Hi there,
>>
>> Ever since all the Meltdown and Spectre kernel updates and possibly also
>> Xen 4.8 updates, we experience crashes of the Dom0 just out of the blue.
>> Sometimes after 1 day, sometimes after a few days or even 14 days,
>> completely random.
>>
>> We have two Dell P730 servers and two Dell P720 servers with this
>> behaviour. One thing is that we updated these machine to the latest
>> available firmware, because that is the most secure way. Then we installed
>> Debian Stretch with Xen 4.8 support
>>
>> We have done serveral installs and 4 servers seem to crash pretty fast
>> and other don't. In the end we think that we can lead it back to the
>> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being unstable.
>> This was kinda independent of the kernel that we were using 4.14 or
>> 4.9.0-8-amd64. This is off course all Debian package numbering.
>>
>> As last resort we updated on one server all DomU kernels of our Jessie
>> servers on this Dom0 to 4.9.0 from backports instead of the 3.16 kernel.
>> For now that seems to work, but the crashes are random so it could happen
>> any time again. The idea is that these kernels are completely spectre&
>> meltdown unaware and might cause trouble in Xen kernel support. I am not
>> sure if this is true at all, but we are pretty lost what the actual cause
>> is.
>>
>> We also tested with CentOS and we also had these crashes there with
>> certain combinations of kernel/Xen. The most recent updates seem to be more
>> stable tough. The most frustrating part is the there is absolutely no logs
>> to be found. No kernel oops or what.. the server just resets and boots
>> again.
>>
>> Are there others experiencing problems like this? Do you see more
>> frequent server/kernel crashes on production servers?
>>
>> Best regards,
>> Roalt Zijlstra
>>
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>>
>> _______________________________________________
>> Xen-users mailing list
>> Xen-***@lists.xenproject.org
>> https://lists.xenproject.org/mailman/listinfo/xen-users
>
> _______________________________________________
> Xen-users mailing list
> Xen-***@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users
Andreas Pflug
2018-11-13 15:19:29 UTC
Permalink
Am 29.10.18 um 12:57 schrieb Roalt Zijlstra | webpower:
> Hi there,
>
> Ever since all the Meltdown and Spectre kernel updates and possibly
> also Xen 4.8 updates, we experience crashes of the Dom0 just out of
> the blue. Sometimes after 1 day, sometimes after a few days or even 14
> days, completely random.
>
> We have two Dell P730 servers and two Dell P720 servers with this
> behaviour. One thing is that we updated these machine to the latest
> available firmware, because that is the most secure way. Then we
> installed Debian Stretch with Xen 4.8 support
>
> We have done serveral installs and 4 servers seem to crash pretty fast
> and other don't. In the end we think that we can lead it back to the
> xen-4.8.4-pre version being stable and the xen-4.8.5-pre being
> unstable. This was kinda independent of the kernel that we were using
> 4.14 or 4.9.0-8-amd64. This is off course all Debian package numbering.
>
> As last resort  we updated on one server all DomU kernels of our
> Jessie servers on this Dom0 to 4.9.0 from backports instead of the
> 3.16 kernel. For now that seems to work, but the crashes are random so
> it could happen any time again. The idea is that these kernels are
> completely spectre& meltdown unaware and might cause trouble in Xen
> kernel support. I am not sure if this is true at all, but we are
> pretty lost what the actual cause is.
>
> We also tested with CentOS and we also had these crashes there with
> certain combinations of kernel/Xen. The most recent updates seem to be
> more stable tough. The most frustrating part is the there is
> absolutely no logs to be found. No kernel oops or what.. the server
> just resets and boots again.
>
> Are there others experiencing problems like this? Do you see more
> frequent server/kernel crashes on production servers? 

Have you tried netconsole logging to a different server? that might
catch that interesting single line of kernel logging that doesn't make
it to disk before reboot.

Regards,

Andreas
Loading...