Tuesday, March 06, 2007

Happy Trails -- The Joy Of Logging

Logging is one of the most important infrastructure activates that a piece of software does. Good logging saves hours of support and troubleshooting time. For instance, every error that is log should indicate a single place in the source code where the error occured. You can do this with unique error code numbers. I often just vary the error text slightly, so errorFileNotFound says "can't find file %s" in one place and "can't find file: %s" in another. The ":" is all I need to know where to look.

I used to develop GUIs and logging wasn't necessary. In those old days of manual testing, you simply ran the app and could see immediately if the labels didn't line up, or the selection rectangle was wonky. Then I moved into telephony, where servers run in very programmer-unfriendly environments. Think ten minutes to reboot a PBX; after you found the PBX technician to do it. If that wasn't enough, random errors were common, such as "some callers never heard the final prompt".

The only solution is good logs. Good being defined as:
  • lots of detail. each log message should include date & time (down to msec), thread, and logging level. Log files should of course be ordinary text files.
  • configurable logging levels so you can increase/decrease/turn off logging without having to re-build, or even re-start the software.
  • rolling log files that have a quota; so the hard disk won't fill up
  • Errors marked in a very visible way (that is, easy to search for).
Log messages don't have to be pretty. Only programmers look at them.

The inventor of Forth, Charles H. Moore, declared that exactly two stacks were what was needed. I always found that remark suspect, having being taught that software should do something either zero, one, or an infinite number of times. However I now believe that two log files are needed. One is the trace log described above: heavy in detail and ugly to read. The second is an error log that contains only errors and startup, shutdown messages. The error log is several orders of magnitude smaller so you can see several days of activity at a glance. All error log messages also appear in the trace log so you can cross-correlate things.

Someone once remarked that they loved Star Trek because every episode some key piece of equipment broke but they always managed to fix it. Scotty would say "run the Level 3 diagnostics". We didn't know what Level 3 diagnostics were, but we know we want them for our software. Good software should have some idea of how sick or healthy it is. A basic health monitor would track: the number of errors, the number of sessions, the number of users, and other high-level stuff. This should be written to the error log once per hour. It will help quickly pinpoint when a problem occurred.

In fact counters are one type of trails, which are things that track software behaviour. My favourite is a Trail object that you give strings to. It builds up a trail of these strings, separated by ';', such as "RING;ANS;PLAY;DELAY;PLAY;DISC". A very short summary of what the software did on a phone call. Trails are good for logging, but even better for unit tests. Adding counters and trails into your classes are a great way to track what they did. You can add wrappers or fancy mock objects. But simple trails often do the job. For example, consider a class that had to send & receive complicate TCP/IP messages. There's a lot to unit test here, including partial or garbled messages, and broken connections. Not to mention race conditions and deadlock! But a simple m_msgCount++ in the send method is very helpful in unit tests because you can very quickly check how many messages were sent. Yes, the counter simply tracks a method call; it doesn't really track whether TCP/IP activity occured. But you have unit tested the lower-level i/o classes haven't you?

At a philosophical level, trails are useful because software exists in time. You can't see software in the way you can see a bridge or a house. All you can see is tiny portions of source code, or pixels on a screen, or other output. We've all experienced coming across a bug in the source code and wondering "how the hell did this ever work?". There are many paths through the code and the broken path may be skipped; except of course when it matters, like you're doing a demo. Basically software is invisible and trails help you to see it.

SpeakRight 0.0.2 is out

Here It's definitely starting to take shape. Simple directed dialog apps can now be done.

Next up: mixed initiative and more SROs.