DevOps — Logging Standards and Best Practices

Logging is an art form, and log messages are one of the primary tools developers use to troubleshoot issues in production. However, the…

Tony

~8 min read · June 25, 2024 (Updated: November 22, 2024) · Free: No

Logging is an art form, and log messages are one of the primary tools developers use to troubleshoot issues in production. However, the importance of logging standards is often overlooked by developers. Logs are like insurance; they might not seem necessary during normal operations, but they become invaluable when something goes wrong. A well-crafted log entry serves as our evidence to the outside world.

Logs Introduction

What Are Logs

Logs, as defined by Wiki, are one or more log files automatically created and maintained by a server, containing a list of activities it performs.

A well-maintained log file provides developers with an accurate record of the system, aiding in pinpointing the details and root causes of system errors. In Java applications, log files are commonly used to record critical logical parameters and exceptions during the application's runtime, supported by log collection systems (like ELK, DTM) to build a system monitoring framework.

Why We Need Logs

There are at least three reasons:

Logging for Debugging: Utilizing logs to document variables or specific logic segments significantly aids in tracking the application's execution flow. This practice is particularly useful for understanding the sequence of operations and interactions within the application, providing a clear view of the execution path and helping to pinpoint where things may have gone awry.
Rapid Issue Identification: When applications encounter exceptions or malfunctions, logs are invaluable for swiftly locating the root cause of the problem, facilitating quicker resolution. Since debugging in a live production environment is often not feasible, and replicating a production environment in a test setting can be both time-consuming and labor-intensive, relying on the detailed information captured in logs becomes crucial for issue diagnosis.
Monitoring, Alerting, & User Activity Audits: Once formatted, logs can be integrated with monitoring systems to configure multidimensional monitoring views. This capability allows teams to keep a pulse on system health and performance or to record and analyze user actions for auditing purposes. By leveraging logs for data collection and analysis, organizations can construct a comprehensive dashboard for business intelligence.

When to use Logs

During Code Initialization or Logic Entry Points: It is critical to log the startup parameters of a system or service. The initialization process of core modules or components often relies on key configurations, which, depending on the parameters, may result in different services being offered. It is essential to record INFO-level logs here, documenting the parameters and the state of the service upon startup completion.
Programming Language Exception Alerts: Exceptions caught in this category signal to developers that attention is needed; they represent highly valuable error reports. Logs should be recorded as appropriate, with the use of WARN or ERROR levels depending on the business context and the nature of the exception.
Deviations in Business Process Flow: Instances where the outcome in project code does not align with expectations also constitute a scenario for logging. In essence, all branching points in the process should be considered for logging. The decision to log such events depends on whether the developer deems the situation tolerable. Common scenarios that warrant logging include incorrect external parameters and data processing issues leading to return codes falling outside of acceptable ranges.
Key Actions in System/Business Core Logic: Business actions triggered by core entities within the system require extra attention and are crucial indicators of the system's operational health. It is recommended to log these actions at the INFO level.
Remote Calls to Third-party Services: In the microservices architecture, an important principle is that third-party services are never fully trusted. It is advisable to log both request and response parameters for remote calls to third-party services. This facilitates problem-solving across different endpoints and prevents a lack of third-party service logs from causing issues. Logging these interactions ensures that developers are not left struggling to pinpoint issues without sufficient data from third-party services.

Best Practices of Logging

Effective logging is crucial for developing, debugging, and maintaining software applications. It provides insights into application behavior, aids in troubleshooting issues, and supports monitoring system health.

Principles of Log Recording

Isolation: Logging should not interfere with the normal operation of the system.
Security: The logging mechanism itself must be free from logical errors or vulnerabilities that could introduce security risks.
Data Privacy: It is imperative to avoid logging confidential or sensitive information, such as user contact details, identification numbers, tokens, etc.
Monitorability and Analyzability: Logs should be accessible for monitoring purposes and analyzable by systems to understand system behavior.
Traceability and Diagnosability: Log output should be meaningful and readable, enabling developers to trace and diagnose issues in production environments effectively.

Logging Level Standards

Pic from toptal.com

In our daily development work, there are four commonly used logging levels, each suitable for different scenarios. The primary levels utilized include:

DEBUG

The DEBUG level is primarily for outputting debugging information and is mainly used during the development and testing phases. Logs at this level should be as detailed as possible, allowing developers to record various detailed information for debugging purposes, including parameter data, debugging details, and return values. This level of detail aids in analyzing issues or exceptions that arise during development and testing.

INFO

The INFO level is used to log critical system information, aimed at capturing key operational metrics while the system is functioning correctly. Developers can log system initialization configurations, changes in business state, or core processes within user business workflows to INFO logs. This facilitates routine maintenance work and context scenario reproduction when tracing errors. It's recommended to set the logging level to INFO in the testing environment upon project completion, then review the INFO-level logs to understand the application's usage and determine if these logs provide useful information for troubleshooting.

WARN

The WARN level outputs content of a warning nature, indicating foreseeable and planned events, such as when a method's input parameter is null or does not meet the conditions for running the method. Detailed information should be logged at the WARN level to facilitate post-analysis of the logs.

ERROR

The ERROR level is intended for unpredictable events, such as errors or exceptions, for instance, exceptions caught in catch blocks related to network communication, database connections, etc. If an exception has a minor impact on the system's overall process, WARN level logging may be used. When logging at the ERROR level, aim to output as much data as possible, including method input parameters and objects generated during the method execution. When logging errors or exceptions, include the exception object in the output.

WARN vs ERROR

When a method or function exhibits abnormal logic execution, it's necessary to log at either the WARN or ERROR level. To determine whether to log an exception as WARN or ERROR, consider the following two aspects for analysis:

Typical WARN Level Exceptions

Incorrect user input parameters
Non-core component initialization failure
Backend task processing failures (if retries exist and are successful, WARN is not necessary)
Idempotent data insertion

Typical ERROR Level Exceptions

Application startup failure
Core component initialization failure
Database connection failures
Continuous failure in accessing external systems required by core business processes
Out Of Memory (OOM) errors

Avoid Overusing ERROR Level Logs. Generally, in systems configured with alerts, WARN level logs typically do not trigger alerts, whereas ERROR level logs are monitored and may even trigger phone call alerts. The occurrence of ERROR level logs indicates very serious issues within the system that require immediate attention.

Common Log Format

Summary Logs

2022-12-15 08:15:10,543 [1a34567930876254987632109xxxxx 0.6 - /// - ] INFO 
[DataHandlerPool-2-thread-21] - [(serviceInterface,actionMethod,2ms,Y,SUCCESS)
(applicationName,192.168.1.100,1671093310543,Y)]

This log entry includes:

Invocation Time
Log Trace ID (traceId, rpcId)
Thread Name
Interface Name
Method Name
Invocation Duration
Successful Invocation (Y/N)
Error Code
System Context Information (Invoking System Name, Invoking System IP, Invocation Timestamp, Under Load Test (Y/N))

Detailed Logs

Detailed logs serve to supplement the summary logs by providing additional business parameters essential for troubleshooting. Generally, detailed logs encompass the following types of information:

2023-01-10 09:30:10,456 [1c89076234908765432109876xxxxx 0.5 - /// - ] 
INFO [WebReqHandler-3-thread-27] - [(serviceAPI,executeTask,2ms,Y,SUCCESS)
(webService,192.168.2.50,1673339410456,Y)(paramA,paramB)(xxxx)

Operational Execution Logs

Operational execution logs are generated during the system's operational process, typically without a specific format. They are logs printed by developers to track the execution logic of the code. The following points should be considered when writing logs:

Is this log essential? Consider if not logging this would hinder troubleshooting in the future, and whether logging it might lead to excessive log output frequency, resulting in an overload of logs in production.
Is the log format distinctive? When monitoring or parsing these logs later, would it be challenging to distinguish them from other logs, or is there inconsistency in the log format with each output?
Include key steps and descriptions of the current execution, clearly stating the purpose of the log for ease of understanding by maintenance personnel in the future.
The log should contain a clear purpose and key parameters of the current execution step.

Suggested format:

[scene_bind_feature][feature_exists]Message[tagSource='MIF_TAG',tagValue='123']

# Example
[workflow_init_process][process_already_initialized] Process already initiated [workflowType='ONBOARDING', workflowId='789']

Common Best Practices

Do not interrupt the workflow: It's crucial to ensure that logging statements do not throw exceptions that could disrupt the business process. For example:

# Function that attempts to log information about a shop
def log_shop_info(shop):
    try:
        # This line simulates accessing a property of 'shop' that might be None
        logging.info(f"Shop ID: {shop['id']}, Name: {shop['name']}")
    except TypeError:
        # Catching exception if 'shop' is None or accessing its properties fails
        logging.error("Failed to log shop information due to a NoneType (null) shop object.")

Avoid using System.out.println() for logging output.
Avoid direct use of APIs from logging systems (Log4j, Logback): Direct utilization of APIs from Log4j or Logback results in tight coupling between the system code and the logging system, leading to significant refactoring costs when there's a need to switch logging implementations in the future.
For logs at the trace, debug, and info levels, it's essential to implement checks for the log level settings. For example:

# Checking if a specific log level is enabled before logging
if logger.isEnabledFor(logging.DEBUG):
    expensive_computation = sum(range(1000))  # Placeholder for an expensive computation
    logger.debug(f'Result of expensive computation: {expensive_computation}')

# Direct logging without level check (for INFO level which is enabled)
logger.info('This is an info level log.')

# This debug log won't be processed because the current level is INFO
logger.debug('This debug message will not be printed.')

Avoid logging messages that lack meaning (void of business context or unassociated with a log trace ID).
Avoid logging duplicate messages.
Refrain from logging sensitive information.
Ensure the size of a single log entry is not too large.

Conclusion

In conclusion, effective logging practices are essential for the smooth operation, maintenance, and security of software applications. By adhering to key principles such as avoiding the logging of sensitive information, refraining from duplicating log entries, and ensuring log entries are concise and do not exceed practical size limits like 100KB, developers can create a logging system that is efficient, manageable, and useful.

Such practices not only optimize application performance but also enhance the ability of teams to monitor, troubleshoot, and analyze system behavior effectively. Ultimately, mindful logging contributes to better resource management, improved system reliability, and heightened security compliance, marking it as a critical aspect of software development and operations.

#devops #cloud-computing #software-development #programming