Discussion:
Watchdog support in XEN?
Andreas Bach Aaen (AH/TED)
2005-07-26 19:53:54 UTC
Permalink
Many physical PCs have a hardware watchdog. This is a good way of getting up
and running again if you somehow have a bug that causes a deadlock.
If you run linux in DomU you will have to use a software watchdog. The kernel
provides such a software watchdog. But could there be scenarios where the
DomU domain will lockup without the software watchdog goes off?
Interrupts?

So my question does XEN provide a "hardware" watchdog that the user domains
can use?
It might be safer that the counter and trigger resides in the XEN domain than
in the user domains.

Regards,
--
Andreas Bach Aaen System Developer, M. Sc.
Ericsson Danmark A/S tel: +45 89 38 51 00
Skanderborgvej 232 fax: +45 89 38 51 01
8260 Viby J Denmark ***@ericsson.com
Ian Pratt
2005-07-26 19:57:34 UTC
Permalink
Post by Andreas Bach Aaen (AH/TED)
Many physical PCs have a hardware watchdog. This is a good
way of getting up and running again if you somehow have a bug
that causes a deadlock.
If you run linux in DomU you will have to use a software
watchdog. The kernel provides such a software watchdog. But
could there be scenarios where the DomU domain will lockup
without the software watchdog goes off?
Interrupts?
So my question does XEN provide a "hardware" watchdog that
the user domains can use?
It might be safer that the counter and trigger resides in the
XEN domain than in the user domains.
It doesn't today, but could easily be added. A better approach might be
to do some more sophisticated higher-level liveness monitoring in
domain0, then use the tools to reboot the domain.

Ian
Mark Williamson
2005-07-26 20:13:22 UTC
Permalink
Post by Andreas Bach Aaen (AH/TED)
So my question does XEN provide a "hardware" watchdog that the user domains
can use?
It might be safer that the counter and trigger resides in the XEN domain
than in the user domains.
A sensible and straightforward way to do this would be to wait for the
XenStore code to be fully merged, then set up an attribute in the XenStore
which is written to periodically by the domU to say that it's live. A daemon
in dom0 can watch this and restart the domain if the attribute isn't updated
for a while. The hypervisor won't need to know about this.

Cheers,
Mark
Andreas Bach Aaen (AH/TED)
2005-07-26 20:21:42 UTC
Permalink
Post by Mark Williamson
Post by Andreas Bach Aaen (AH/TED)
So my question does XEN provide a "hardware" watchdog that the user
domains can use?
It might be safer that the counter and trigger resides in the XEN domain
than in the user domains.
A sensible and straightforward way to do this would be to wait for the
XenStore code to be fully merged, then set up an attribute in the XenStore
which is written to periodically by the domU to say that it's live. A
daemon in dom0 can watch this and restart the domain if the attribute isn't
updated for a while. The hypervisor won't need to know about this.
This seems like a good idea. I expect that you have atomic writes in the
XenStore, so the dom0 deamon simply increments a timer that the domU needs to
reset once in a while. This could be written into the watchdog userspace
deamon that automatically detect that it's inside a virtual machine and not
directly on real hardware. What is the cleanest way to detect id you are
running in domU or not?

Regards,
--
Andreas Bach Aaen System Developer, M. Sc.
Ericsson Danmark A/S tel: +45 89 38 51 00
Skanderborgvej 232 fax: +45 89 38 51 01
8260 Viby J Denmark ***@ericsson.com
Mark Williamson
2005-07-26 20:48:35 UTC
Permalink
Post by Andreas Bach Aaen (AH/TED)
This seems like a good idea. I expect that you have atomic writes in the
XenStore, so the dom0 deamon simply increments a timer that the domU needs
to reset once in a while.
Yes, something like that should be fine, although I think the intention is to
have only one writer to each portion of the store. How about the domU
increments / toggles the value and the dom0 daemon could notice when it
hasn't been updated in a while.
Post by Andreas Bach Aaen (AH/TED)
This could be written into the watchdog userspace
deamon that automatically detect that it's inside a virtual machine and not
directly on real hardware. What is the cleanest way to detect id you are
running in domU or not?
There's a flag in the Xen startinfo that tells a domain if it's dom0 or not
(and if it's a driver domain, etc.).

The most straightforward thing to do is probably to write a kernel driver
using the Linux watchdog API and then have that talk to the xenstore.
This'll allow you to use the standard daemon without any changes.

The kernel API for watchdogs is fairly simple so it should be straightforward
once the XenStore / XenBus stuff is all up and running.

Cheers,
Mark
Eric S. Johansson
2005-07-27 19:23:17 UTC
Permalink
Post by Mark Williamson
Post by Andreas Bach Aaen (AH/TED)
So my question does XEN provide a "hardware" watchdog that the user domains
can use?
It might be safer that the counter and trigger resides in the XEN domain
than in the user domains.
A sensible and straightforward way to do this would be to wait for the
XenStore code to be fully merged, then set up an attribute in the XenStore
which is written to periodically by the domU to say that it's live. A daemon
in dom0 can watch this and restart the domain if the attribute isn't updated
for a while. The hypervisor won't need to know about this.
forgive me if I'm misunderstanding something but it seems to me that one
would want to use a hardware watchdog on dom0 so if the system should
well and truly fail, an undeniable reset could be applied to the entire
system. But if dom0 is healthy, then yes, a software watchdog in dom0
paying attention to and deciding when to reset the various domU's should
be sufficient.

---eric
Mark Williamson
2005-07-27 19:58:20 UTC
Permalink
Post by Eric S. Johansson
forgive me if I'm misunderstanding something but it seems to me that one
would want to use a hardware watchdog on dom0 so if the system should
well and truly fail, an undeniable reset could be applied to the entire
system. But if dom0 is healthy, then yes, a software watchdog in dom0
paying attention to and deciding when to reset the various domU's should
be sufficient.
I think you'd ideally want both:
* Hardware watchdog in dom0 in case dom0 or Xen crashes
* Software watchdog for domU is provided by dom0 (which we can guarantee is up
because of the hardware watchdog) via the store

In the absence of a hardware watchdog, you could also implement a software
watchdog for dom0 in Xen itself, which is likely to be the most reliable
piece of code in the system and shouldn't lock up even if dom0 does.

Cheers,
Mark
Andreas Bach Aaen (AH/TED)
2005-08-01 07:00:22 UTC
Permalink
Post by Mark Williamson
* Hardware watchdog in dom0 in case dom0 or Xen crashes
* Software watchdog for domU is provided by dom0 (which we can guarantee is
up because of the hardware watchdog) via the store
In the absence of a hardware watchdog, you could also implement a software
watchdog for dom0 in Xen itself, which is likely to be the most reliable
piece of code in the system and shouldn't lock up even if dom0 does.
Correct Mark. This is the solution that I would prefer. Unfortunately I wont
have time for programming this. I hope that others do before I really need
the feature. For now its just on my wish list.

Regards,
--
Andreas Bach Aaen System Developer, M. Sc.
Ericsson Danmark A/S tel: +45 89 38 51 00
Skanderborgvej 232 fax: +45 89 38 51 01
8260 Viby J Denmark ***@ericsson.com
Loading...