Thursday, November 25, 2021

Thoughts on Logging



  1. What should go into a log? 
  2. Enough, but not too much.

Those two phrases are contradictory and context dependent. This is why logging has different levels. They may be expressed as numbers or words. Each more verbose level is inclusive of the less verbose levels. Sematext has a nice in-depth overview

I can’t count the number of times that initial log collection didn’t uncover the root cause of a problem. I’ve also disabled or crashed the target system by turning logging too high. Balance is the key.

At the one extreme, Fatal is “don’t log unless you’re dying” which is awfully presumptive that the application will be able to realize it’s dying. A more common option is Error (only log problems) or Info (log what you’re doing, not how).

At the other extreme, Trace level logging is preemptive print-debugging. Each important function in each routine, when did it start, what was the input, what was the outcome. If issuing a special build for a customer to troubleshoot is a common occurrence, or something that you wish you could do, then Trace logging is for you. If the application still performs acceptably in production with Trace on, you’re probably not logging enough. Consider putting a built-in timer into the application to shut Trace off after a set amount of time.

Each line of a log should start with the date and time. UTC time is best. Logging in the local time of the device is regrettable, but acceptable if you’ll never need to correlate the events from one device with another one in another time zone. Logging in local time will cost unexpected effort and cause problems, but sometimes it can’t be avoided.

The time format should be ISO8601. Unix epoch is not ideal because people can’t read it. The format of your favorite locale is not ideal because people will get confused. ISO8601 or GTFO.

Entities should have identifiers so they can be traced. This is pretty simple if you’re the only application of your type running on a single physical system. The thing writing the log is the entity that matters and the hostname is plenty of ID. It gets deep if you’re an elastically scaling micro service on multiple cloud providers. Is the thing writing the log what matters any more? You might even architecturally be able to say that you don’t care and it’s all going to wash out at the services level… as long as you never have to debug or do forensics.

Tasks (or threads, or forked processes, or containers) should have identifiers so they can be traced. This is a remarkably deep problem; don’t let perfect be the enemy of good. Again, your architecture might suggest this isn’t necessary, and that suggestion may be correct in many circumstances.

Each line of the log should include a single event with time, entities, and tasks all recorded up front. Some of that feels repetitive and like you would only need to emit it once per file. That would be fine if you never need to correlate with the logs from other systems in order to trace a transaction.

It may be tempting to log an entire transaction as a single event, but that presumes your application will stay alive long enough to see the transaction neatly complete. It is better to log the start and end of a transaction as separate events. 

It is also tempting to use Event IDs instead of language to describe what is happening, since “404” takes less space than “Page not found.” A human readable string should be included as well so that humans can learn your IDs by use. This also allows you to potentially translate your event ID into different languages for use by different humans. You should never use translated strings without event IDs though, this is annoying. Why? Because it makes things harder for organizations that operate in many languages. Microsoft did not keep this behavior when upgrading Windows, but customers and log analysis vendors have still needed to build separate Windows log parsers for every language they'll support.

Writing a log to a file is the obvious path, and there’s a whole world of tooling for rerouting that file output into searchable storage. The simplicity of a file or cloud bucket is a huge win; if you’re up at all, you can probably write a file; if you can’t write to a file, the reason is probably obvious and more widespread than a subtle problem in your program. However, there are also reasons to write to a structured data system. Log data is at least semistructured, and using a structured system allows indexing and search. There’s cost down this road though; either the commercial cost of a tool or service, or the performance cost of a database. Furthermore, the log consumer now has to get into the database, or your product has to provide an auditing API. In my opinion databases are rarely a good idea for log storage.

Some other thoughts on logging: