Saturday 29 June 2013

Power Cut!

Forty-seven days of faultless running were brought to an abrupt halt yesterday by a power outage, when quite a large area of Liverpool was blacked out at just after 13:00.

The aftermath revealed a small flaw in the Polly-Pi system:  The Raspberry Pi has no real time clock on board.  Normally this is no problem, and ntp is used to set and maintain the correct time.  But there's a snag after a power outage:  At the time the Polly-Pi server rebooted on restoration of power, the broadband modem and the local server running ntpd were both still in the process of starting up, so no time was available.  Consequently, everything came up with the wrong clock setting, the clock was actually showing 12:18, a time about forty minutes before the outage, while the correct time was actually 14:40.  I watched Polly come up and connect to all the PICNET and ETHPIC nodes and then relaxed, not noticing the clock was wrong.  A couple of minutes later ntp service kicked in and suddenly the time was 14:42.  This led to all the communications protocol's timeouts expiring, resulting in the loss of communication with all the nodes.  Once normal timing was resumed, all the nodes came back on line, of course, and everything has been fine since then.

So, some kind of change is needed to improve this.  What to do?

1.  The obvious solution is to provide some RTC (Real Time Clock) hardware.  This would completely avoid the problem, with the late restoration of ntp service merely tweaking the time to compensate for inaccuracies in the RTC.  However, this seems like a significant effort for what should hopefully be a rare occurrence - Perhaps I can do things easier in software?

2.  How about just waiting a bit longer before starting?  Polly already has a 60 second pause at startup to allow everything in the house to stabilise after a mains failure.  If this pause had been a couple of minutes longer the problem wouldn't have arisen.

3.  Or, better, what about a more intelligent start-up pause?  Could I check for network connectivity and/or ntp status and wait until things are OK?  How long should I wait?  What if ntp never comes up (Local ntp server fails to reboot, and internet is down.)  I need to 'give up' eventually and make do with the time I've got.

Decision:  I'm going to implement option 2 immediately, and consider option 3 later.

The new build is installed and running, and I also took the opportunity to perform a load of debian updates.

Tuesday 7 May 2013

Local syslog using rsyslogd

Phase one of the new logging system is to use the local syslog daemon.  This turned out to be pretty easy to set up, as glibc has the appropriate functions, so the line

syslog(LOG_MAKEPRI(LOG_LOCAL0, LOG_NOTICE), "Some message");

was all that was needed, and suddenly the log reports from Polly were appearing in the syslog file at /var/log/syslog

That was easy, wasn't it.  But I want to put the log messages in a separate file (or two) so now I need to learn how to configure rsyslogd in order to filter them out.  A brief study of the interweb and I'd added the command

local0.*    /var/log/polly.log

into rsyslogd's configs.  But it didn't work.  I fiddled.  I tweaked.  I learned how to validate the config by typing rsyslogd -N1  but no joy.  I tried other syntaxes (Just to confuse matters, rsyslogd accepts configuration commands in three totally different formats, and you are free to mix them in the same configuration file.)

if $syslogfacility-text == 'local0' then /var/log/polly.log

But my new log file remained steadfastly empty.

Eventually I spotted a log message, hidden in auth.log, complaining I was providing an "unknown facility/priority: 405".  Aha, does that tell me anything?

The facility/priority number in syslog is defined as priority plus 8 * facility, and that's what LOG_MAKEPRI is supposed to be doing for me.  Facility LOCAL0 is 16, and priority NOTICE is 5, so the combined value should be 133 (decimal).  A couple of printf lines soon revealed that LOG_MAKEPRI(LOG_LOCAL0, LOG_NOTICE) was returning 1029 (0x405).  It was soon obvious that the LOG_LOCAL0 macro had already got the multiplication by 8 included.  I changed the call to syslog to

syslog( (LOG_LOCAL0 | LOG_NOTICE), "message");

and my filter was suddenly working.

As far as I can see this is a bug in the glibc documentation.  Hmph!  Anyone know how I report that?  Before doing so, I downloaded the latest development source for glibc and they've already fixed it!  Personally I would have altered the documentation to match the code, but there you are.

Anyway, I was now able to create a proper rsyslogd configuration file, located in /etc/rsyslog.d/polly.conf containing

if $syslogfacility-text == 'local0' and ($msg startswith ' ETHPIC') then /var/log/polly-ethpic.log
if $syslogfacility-text == 'local0' and not ($msg startswith ' ETHPIC') then /var/log/polly.log

# Don't pass it on
local0.*    ~

This splits out the rather busy ETHPIC log entries into a different file.  The last line discards all LOCAL0 messages, so they don't get passed on to the later rule which puts everything into the messages file.

I am still weighing up the pros and cons of phase two of the syslog project.  This would involve reconfiguring rsyslogd to pass the log messages to another server, with the aim of reducing wear and tear on the Raspberry Pi's SD card.  Qustions which need to be considered include
  • Is SD card wear a problem?
  • What happens if the logging system is down?



Monday 6 May 2013

Result!

Well, my guess turned out to be correct, and the temporary version of the Polly server with no log files is responding instantly to all button presses and commands.  All I need to do now is create some kind of non-blocking log system.  That sounds like an awful lot of coding, but wait, someone's already been there:  Unix's syslog system already has the capability to log to a remote system and would appear to do all I need.  So all that remains is to find out how to drive it...

Saturday 4 May 2013

Up and Running

Apart from one intentional shutdown 25 days ago, PollyPi has been controlling my house without any problems for over a month, so I officially declare it to be fully operational!

What's next?

The first thing to look at is an intermittent problem where the response time of the system seems excessively slow.  The philosophy of putting all the control logic in the server and none in the remote nodes means when I press a button on a control box, a message has to go up and down before the button lights up.  This delay was never noticeable on the PC based system but it is sometimes obvious now.  The fact that it only occurs sometimes suggests it's not just a plain performance issue.  My gut feeling, based on no evidence whatsoever, is that the multiple writes to log files are clogging things up.  I can quickly confirm or disprove this by producing a test build which doesn't log anything.  Stay tuned ...

Saturday 6 April 2013

Progress At Last

I was finally able to devote a little time to the project, and I'm pleased to report that just a few minutes on the internet came up with a small tweak which some users reported had resolved their lockup problems.

So I added the magic incantation 
dwc_otg.speed=1
into the configuration file /boot/cmdline.txt and things are looking a lot better.

The up time is over four days at the time of writing, which is a lot better than it achieved before and I am currently letting Polly-Pi take over live control periodically, when a crash wouldn't be too inconvenient.

Now it's a bit early to be getting excited, all the previous incarnations of Polly have a total score of one crash in almost twenty years of continuous running, so four days hardly proves anything, but we'll see . . .

Update 09 April:  Still going strong. . .

Friday 1 February 2013

Serial Port Woes

The project to move the server to run on a Raspberry Pi has ground to a halt.  I need two serial ports, one for collecting CLI information from the phone line, and more importantly, one for PICNET.  Using USB-Serial adaptors I have been experiencing lockups and crashes, there would appear to be a problem somewhere in the USB driver stack.  I have tried various models of USB adaptor but so far they all seem to experience the same problem.

Unfortunately I have been busy with other activities so I haven't been able to investigate in greater depth yet.  So far I've determined that my server software continues to run after these lockups, recording in its log file that the PICNET network has gone down and also recording loss of communications to the display client and all the ETHPIC nodes.  (The ethernet port on the Pi hangs off the USB system.)   So it would appear that the USB system has locked up in some way.

As a quick test I set up a level shifter for the on-board serial port and used that for PICNET and it worked fine.  I added a USB Serial adaptor, but didn't tell the Polly server to use it, and still there were no problems and a test run of ten days was completed.  Finally, without rebooting or changing anything else I configured the CLI function in the server software to use the USB adaptor, and within a few hours everything was locked up again, despite the fact that there had been no phone calls so presumably no serial data to transfer.  This same configuration - "real" serial port for PICNET and USB adaptor for CLI has been used on the previous server PC for years without difficulties.

So I think it's time to trawl the internet to see if anyone else is having problems.