All Logs Go To Heaven
How to write logs that save time, reduce frustration, and help software thrive in the cloud.
Author’s Note: I originally wrote this as a confluence page at Vectra. They have kindly allowed me to repost this article online :-D.
Observability is a critical feature of any successful SaaS product. If we can observe what our code is doing, we can determine whether our software is
functioning as designed or if it is on the precipice of disaster. The most common and simplest form of observability is logging. Its simplicity, however, brings many drawbacks: logging too little can make issues seem impermeable, while logging too much can add unnecessary noise and reduce software performance. While there are tools available to help filter messages, poorly formatted or poorly worded log messages can perplex readers and undermine filters. And with so many log levels to choose from–Trace, Debug, Info, Warn, Error, Fatal, Severe, Critical–it can be hard to know how to correctly classify each message.
Fortunately, there is a simple way to combat these logging challenges and can be accomplished by anyone who reads or writes logs.
Do Not Be Proactive
Most areas of software design and development encourages us to plan for the future–to always be prepared to accommodate changing needs with minimal impact. That could be skewed to mean that logs should be added early on, because adding logging is adding code to the system. And while that is true, it must also be taken into account that a log message does not change functionality. Tests don’t validate logs nor would an end user need to interact with them.
Instead of logging for the future, log for the present. Add logs to investigate a problem, but replace them with other measures (like tests, traces, metrics), once the problem is resolved. A log used to report a potential problem is generally useful and should likely remain a fixture within a codebase, but a log used to track execution of a code may more often add noise.
In order to decide whether a log should be added consider this: if this message were invoked a million times, would it add value or would it add noise? For example, a message that read, “Could not connect to database” were invoked a million times, that could be cause for concern. However, something like, “Purchase successful,” invoked a million times wouldn’t indicate anything out of the ordinary and shouldn’t be logged.
While logs are cheap to add, too much logging can add noise, impact system performance, and be costly to store. So be sure to make every log count.
A Stack Trace Is Not a Log
Stack traces represent critical program failures that, if detected, should be fixed immediately. A solution to ailing software should be to fix the problem instead of expecting a restart.
Do Not Log Personally Identifying Information
Emails, usernames, passwords, home addresses, and phone numbers are but a few examples of things that should not be reported in logs. Legal authorities in many jurisdictions require this data to be protected, and reporting that information in logs undermines that protection.
Be Specific and Use Natural Language
Unlike errors, which may be captured and handled by software, log messages are consumed by humans. Humans, who sometimes may not be the stewards of the code that is misbehaving. Log messages should be descriptive and contain sufficient context:
Instead of
error returned from database
Write as
Failed to fetch order info from database.
Use a Structured Logger
Writing structured logs greatly improves the ability to search and filter messages, reduces the need for complex pattern matching schemes, and improves the ability to track the frequency of recurring messages over time.
Do Not Inject Variable Values into the Body of a Message
If you are writing structured logs, variable values should make up a log message’s attributes. One of the benefits of this is that you can write a simple match against a static string without having to use wildcards or regular expressions. Whereas injecting values inline undermines the purpose of having structured logs in the first place:
Instead of
logger.error(f”Error searching by attribute: {err}”)
Write as
logger.error(“Error searching by attribute.”, err=err)
Instead of
logger.info(f”Scanning {count} files”)
Write as
logger.info(“Scanning files.”, count=count)
Use Consistent Rules for Naming Structured Log Fields
Simple naming rules for structured log fields simplifies the ability to search for logs across a single context, like a tenant or a request id. However, inconsistent naming patterns complicate those efforts. Tenant ID, for example, could be written as tenantid , tenant_id , or any number of combinations.
Log field names should be written as lowercase and snake case. The reason for this is to minimize the ambiguity of searching for field names, especially in cases where acronyms and abbreviations–like id, which could be written like id , Id , or ID –would be used.
The field name err is reserved for error messages.
Use One Logger Per Request
Using a single logger per request allows you to propagate request context to all related messages, reducing the overhead of reinstantiating a new logger per package and repopulating it with relevant data. This is especially helpful when filtering logs by a particular request.
def set_logger_middleware(f):
@wraps(f)
def func_wrapper(*args, **kwargs):
g.log = get_logger(path=request.path, method=request.method)
return f(*args, **kwargs)
return func_wrapper
Full example:
https://gist.github.com/smousa/184329475ac1936c2d4b719d1c29e90d#file-context_logger-py-L10-L15
Use Middleware to Automatically Log Messages Sent to the External Caller
This allows code to be concise and keeps your logging consistent. Errors should be logged at the error level, while successful calls can be optionally logged at the info level.
def log_response_middleware(*args, **kwargs):
@wraps(f)
def func_wrapper(*args, **kwargs):
resp, status_code = f(*args, **kwargs)
if status_code < 200 or status_code >= 300:
g.log.error(“Could not handle request”, err=resp.err)
return resp, status_code
return func_wrapper
Full Example:
https://gist.github.com/smousa/184329475ac1936c2d4b719d1c29e90d#file-context_logger-py-L17-L24
Instead of Logs, Use…
- Tests. When logs are added to diagnose an issue; upon writing a fix, add tests to support the use case in which the problem occurred and delete the log messages. Tests are a powerful defense against regressive behavior, so if subsequent code changes break existing functionality, it is better to detect that problem before code is released with a test, rather than after the code is released with a log.
- Tracing. The downsides of logging code execution (using messages like “Entering function X”/”Exiting Function X”) is that they have to be reported on every invocation: they dilute the logs with largely useless messages, and they can impact software performance. Tracers like X-Ray, Cloud Trace, and Distributed Tracing allows you to set the frequency in which tracing occurs, tracks the duration of invocations, and allows additional metadata to be reported on a trace.
- Metrics. Analytical usage data to track batch sizes, performance, and usage can be tracked with metrics. Unlike logs, which typically tracks usage by log levels, metrics promotes alerting on patterns of a collection of different data and can help with efficient autoscaling of services.
- Health checks (Readiness/Liveliness). Unlike logging the state of a service, a health check should be used in conjunction with microservice startup to determine the health and availability of a service.
Apply Log Levels Consistently
- Fatal: Should only be used at service startup/teardown and never at the user request level. Fatal logs should address problems with configuration or service communication that severely impairs request handling.
- Warning: Should only be used at service startup/teardown and never at the user request level. Warning logs should address problems with configuration that indicates the service is running in a non-production state. Certain configuration issues include running a service with a test
mode enabled, authentication disabled, or in connection to an emulator. - Info: Can be used to communicate information relating to service startup and teardown. However, info logs should be used temporarily and sparingly during request handling to capture debugging information relating to a specific problem. Once issues are resolved, those logs should be replaced with other observability methods.
- Error: In addition to capturing messages returned to external caller, error logs should also report issues that would not necessarily be
reported externally. This may include retries or when a backup strategy is used to complete a request.
Additional Information
- https://nedbatchelder.com/text/log-style-guide.html
- https://dave.cheney.net/2015/11/05/lets-talk-about-logging\
- http://blog.afropolymath.co/intercept-route-requests-using-decorators-in-flask
- https://www.structlog.org/en/stable/api.html
- https://newrelic.com/blog/how-to-relic/python-structured-logging
Thanks for reading!