Saturday 29 June 2013

Power Cut!

Forty-seven days of faultless running were brought to an abrupt halt yesterday by a power outage, when quite a large area of Liverpool was blacked out at just after 13:00.

The aftermath revealed a small flaw in the Polly-Pi system:  The Raspberry Pi has no real time clock on board.  Normally this is no problem, and ntp is used to set and maintain the correct time.  But there's a snag after a power outage:  At the time the Polly-Pi server rebooted on restoration of power, the broadband modem and the local server running ntpd were both still in the process of starting up, so no time was available.  Consequently, everything came up with the wrong clock setting, the clock was actually showing 12:18, a time about forty minutes before the outage, while the correct time was actually 14:40.  I watched Polly come up and connect to all the PICNET and ETHPIC nodes and then relaxed, not noticing the clock was wrong.  A couple of minutes later ntp service kicked in and suddenly the time was 14:42.  This led to all the communications protocol's timeouts expiring, resulting in the loss of communication with all the nodes.  Once normal timing was resumed, all the nodes came back on line, of course, and everything has been fine since then.

So, some kind of change is needed to improve this.  What to do?

1.  The obvious solution is to provide some RTC (Real Time Clock) hardware.  This would completely avoid the problem, with the late restoration of ntp service merely tweaking the time to compensate for inaccuracies in the RTC.  However, this seems like a significant effort for what should hopefully be a rare occurrence - Perhaps I can do things easier in software?

2.  How about just waiting a bit longer before starting?  Polly already has a 60 second pause at startup to allow everything in the house to stabilise after a mains failure.  If this pause had been a couple of minutes longer the problem wouldn't have arisen.

3.  Or, better, what about a more intelligent start-up pause?  Could I check for network connectivity and/or ntp status and wait until things are OK?  How long should I wait?  What if ntp never comes up (Local ntp server fails to reboot, and internet is down.)  I need to 'give up' eventually and make do with the time I've got.

Decision:  I'm going to implement option 2 immediately, and consider option 3 later.

The new build is installed and running, and I also took the opportunity to perform a load of debian updates.

No comments:

Post a Comment