Master’s degree, Perm National Research Polytechnic University, Russia, Perm
MONITORING AND LOGGING IN DISTRIBUTED SYSTEMS: APPLICATION OF OPENTELEMETRY AND THE ELK STACK
ABSTRACT
This article examines the approach to monitoring and logging in distributed systems using OpenTelemetry and the ELK stack. Particular attention is given to the issues of tracing and log correlation in microservice architecture. The methods for ensuring log coherence through the use of unified identifiers and the possibilities of integrating OpenTelemetry for automating the tracing process are analyzed. The process of log aggregation and anomaly detection in distributed systems is studied. Additionally, it explores performance monitoring based on distributed context, discussing key metrics and performance indicators, as well as the configuration of dashboards and alerts in the ELK stack.
АННОТАЦИЯ
В данной статье рассматривается подход к мониторингу и логированию в распределенных системах с использованием OpenTelemetry и ELK-стека. Особое внимание уделяется вопросам трассировки и корреляции логов в микросервисной архитектуре. Анализируются методы обеспечения связности логов с использованием унифицированных идентификаторов и возможности интеграции OpenTelemetry для автоматизации процесса трассировки. Изучается процесс агрегации логов и выявление аномалий в распределенных системах. Также исследуется мониторинг производительности систем на основе распределенного контекста, рассматриваются ключевые метрики и показатели эффективности, настройка дашбордов и оповещений в ELK-стеке.
Keywords: monitoring, logging, distributed systems, microservice architecture, log tracing, log correlation, OpenTelemetry, ELK stack.
Ключевые слова: мониторинг, логирование, распределенные системы, микросервисная архитектура, трассировка логов, корреляция логов, OpenTelemetry, ELK-стек.
Introduction
Current distributed systems based on microservice architecture offer high flexibility and scalability, although simultaneously present multiple challenges in providing stability and performance. The biggest challenges are because of the need for good monitoring and logging, which are critical in monitoring the activity of systems, fault detection, and application performance tuning. There are a number of microservice applications within these systems that generate heaps of logs, which need to be well-correlated, aggregated, and analyzed so that the whole infrastructure can run perfectly.
Tracing user requests and correlating logs become particularly critical tasks, as they allow the tracking of request flows across multiple services, identification of bottlenecks, and elimination of error sources. Additionally, key aspects of observation include log aggregation, anomaly detection, and performance control. To address these complications, modern tools such as OpenTelemetry, which provides standardized telemetry collection mechanisms, and the ELK (Elasticsearch, Kibana, Beats) stack, which ensures efficient log storage, analysis, and visualization, are employed.
The goal of this research is to analyze the issues related to log tracing and correlation in microservice architecture and to explore methods for log aggregation, anomaly detection, and performance tracking based on distributed context. The relevance of the study is driven by the growing popularity of microservice approaches and the need to ensure their reliability and high performance. The research methods employed include a comparative analysis of existing monitoring tools, an examination of architectural features of systems built on OpenTelemetry and the ELK stack, and a discussion of practical scenarios for their application.
Research methodology
Tracing and log correlation in microservice architecture
Microservice architecture is one of the key approaches to building modern distributed systems. Its primary advantage lies in the ability to decompose complex applications into independent services, thereby simplifying their development, deployment, and scaling. This architectural style also introduces significant challenges in supervision and logging. In particular, effective request tracing and log correlation become quite important tasks for maintaining system stability and ensuring the prompt resolution of failures.
Request tracing in microservice architecture refers to the process of tracking the path of a request as it passes through various services involved in its processing. Each service may generate its own logs. However, without a linking mechanism, it becomes exceedingly difficult to understand how individual events relate to one another. This lack of coherence complicates problem diagnosis, as it becomes challenging to pinpoint the exact location of a failure within the processing chain [1].
One of the fundamental obstacles in tracing is the absence of common identifiers that would allow events occurring across different parts of the system to be linked. When such identifiers are missing, engineers are forced to manually correlate logs based on timestamps or message content. The complexity is further heightened by the distributed nature of the system, where request processing may occur in parallel or exhibit a high degree of asynchrony.
Log correlation aims to consolidate log entries from various services into a coherent sequence that represents the complete path of a request. This process is particularly crucial for incident diagnosis and system performance analysis. However, log correlation encounters significant challenges due to data heterogeneity. Each service may adopt its own logging formats and message structures, complicating the alignment and interpretation of log records.
In order to handle these issues, the concept of distributed context is employed. Distributed context ensures the propagation of unique request identifiers across all processing layers, enabling the accurate correlation of logs and metrics. OpenTelemetry, an open standard for collecting telemetry data, plays a pivotal role in implementing this concept. It provides tools for the automatic collection of tracing data and their transmission to monitoring and analysis systems. By leveraging identifiers such as Trace ID and Span ID, OpenTelemetry makes it possible to trace the entire request path, from the entry gateway to the final service, offering a comprehensive view of system behavior and facilitating efficient issue resolution [2].
The integration of OpenTelemetry begins with the implementation of client-side and server-side libraries within the application code. These libraries automatically collect data on operation execution times and transmit them to monitoring systems. This approach minimizes the need for manual intervention within the logging process, thereby decreasing the likelihood of errors. OpenTelemetry also offers a standardized API, allowing developers to tailor data collection to a particular application's needs without compromising compatibility with other systems.
Another critical aspect of effective monitoring is the standardization of logging formats. The use of universal formats, such as JSON, simplifies log aggregation and subsequent analysis. Standardized data formats enable faster identification of relationships between events across various services, thereby accelerating problem diagnosis. The integration of OpenTelemetry with systems like the ELK stack further enhances visualization and analysis capabilities for correlated logs. Kibana from the ELK stack provides robust tools for creating interactive dashboards, facilitating real-time tracking of key performance metrics and comprehensive system performance analysis.
Data security and system resilience are of paramount importance in distributed systems [3]. Given that OpenTelemetry involves the collection of substantial volumes of data, it is essential to ensure the protection of this information during transmission and storage. The use of reliable data transfer protocols and a well-designed access control system mitigates the risks of data breaches and guarantees the stability of monitoring operations.
Log tracing and correlation in microservice architecture are foundational processes for maintaining the reliability of distributed systems. The adoption of OpenTelemetry and standardized logging practices ensures the necessary transparency of system operations, simplifies problem diagnosis, and accelerates issue resolution.
Results and discussion
Log aggregation and anomaly detection in distributed systems
Consolidation and irregularity identification in logs are fundamental processes for ensuring the stability and security of distributed systems. In a microservice architecture, where each application generates a high volume of logs, it is a complex job to manage the data. The data must be gathered and processed appropriately to identify deviations and threats that can compromise the performance and security of the system.
Log aggregation is a structured collection of logs from multiple sources into a single repository that can be later on analyzed. A leading tool known for its functionality in log aggregation is ELK stack that consists of Elasticsearch, Logstash, and Kibana. Elasticsearch is in charge of quick indexing as well as data warehousing, which is important in scenarios that involve high production of logs. Logstash is in charge of collection, filtering, as well as normalization of logs into a common input format, hence readying them for examination. Visualization is supported by Kibana, which allows advanced visualization of data. Its broad application across industries is a testament to its flexibility as well as success in handling massive data processing, and in providing quick analytical outcomes that are essential in maintaining optimal functioning in a system (fig. 1).
Figure 1. Distribution of ELK customers by products and services for hosted search worldwide in 2024 [4]
The capacity for analyzing logs in real time is a critical requirement in distributed systems. In order to meet this requirement, solutions that adopt a streaming approach, i.e., solutions like Kafka, are utilized, which allow dealing with high volumes of data with minimal latencies. The ability in terms of real-time processing allows quick identification of faults as well as quick reaction towards probable complexities. In addition, fault-tolerant as well as scalable architecture forms critical components in the log collection infrastructure. The amount of logs is in direct relation with rising loads in a system. In a non-fault-tolerant as well as non-scalable architecture, it leads towards breakdowns in analytical instruments as well as monitoring activities.
Anomaly detection in logs involves identifying deviations from the standard behavior of a system. Such deviations may indicate service failures, security breaches, or performance degradation. Various approaches currently exist for detecting anomalies (table 1).
Table 1.
Anomaly detection methods and their characteristics [5, 6]
Method |
Description |
Suitable scenarios |
Advantages |
Limitations |
Statistical methods |
Analysis of deviations from normal values. |
Predictable systems, stable metrics. |
Simple implementation, fast results. |
May not detect complex anomalies. |
Heuristic rules |
Search for known error patterns. |
Systems with known error types. |
High accuracy for known issues. |
Requires regular updates of rules. |
Machine learning |
Automatic detection of atypical events. |
High-load and complex systems. |
Detects hidden relationships. |
Requires large data volumes and computing resources. |
Log clustering |
Grouping similar events in logs. |
Analysis of atypical events in logs. |
Helps identify hidden patterns. |
Complex interpretation of results. |
The incorporation of OpenTelemetry with the ELK stack substantially improves the capabilities of anomaly detection. OpenTelemetry guarantees the collection of telemetry data, including metrics and traces, which are subsequently transferred to the ELK stack for storage and analysis. This integration allows anomalies identified in logs to be correlated with specific events or changes within the system.
The application of effective visualization is one of the critical components in detecting anomalies. The functionality provided by Kibana allows users to take full advantage of complete dashboards, making it possible to have a live monitoring of key performance measures. The functionality allows engineers to respond rapidly in case of resultant issues as well as minimize system outages. In addition, configuration of automated notifications is a feature that ensures a more preventive approach toward failure prevention before its impact on end users.
The effective management of logs and detection of outliers are important elements in distributed system administration. The combination of ELK with OpenTelemetry is a powerful toolkit in data collection, analysis, as well as visualization. The combination allows quick detection of issues, improves failure resistance in the system, as well as ensures that apps are functioning as expected.
Performance monitoring based on distributed context
Performance tracking in distributed systems is important in providing their reliability, scalability, and stability. In the context of microservice architecture, where applications consist of numerous interacting services, effective monitoring becomes a complex challenge [7]. The primary objective of performance monitoring is the timely identification of issues that impact system performance and their resolution before they negatively affect end users.
Throughout this process, distributed context is a key instrument utilized in order to attain a complete understanding at multiple levels. In presenting a single picture of prevailing interrelationships between services, distributed context helps in detecting bottlenecks in terms of performance, makes root cause analysis a possibility, and ensures effective operation over a long duration in complex distributed systems.
The commencement of monitoring requires that important measures be identified which are monitored on a regular basis. The measures that are generally included are server reaction time, overall system throughput, CPU, memory, disk, as well as network utilization rates, in addition to error frequency. All these measures combined are important in determining the state of a given system at a given time, as well as in detecting deviations that can signal a fault.
An essential aspect of effective performance monitoring is the configuration of alert systems that notify engineers of critical issues. Through leveraging data collected through OpenTelemetry, it is possible to set up triggers in Kibana, which automatically send notifications when key performance metrics reach critical thresholds. Furthermore, performance monitoring optimization involves workload forecasting and infrastructure adaptation. By analyzing historical system performance data, it becomes feasible to predict peak load periods and preemptively scale resources. The analysis of distributed context plays a crucial role in identifying services most sensitive to load increases, ensuring that these components receive additional attention during scaling efforts. Moreover, distributed context-based monitoring significantly contributes to the resilience of distributed systems [8].
Distributed context-based performance monitoring plays a crucial role in governing modern distributed systems. Monitoring like this provides profound insight into inter-entity dependencies, makes it possible to effectively identify and correct bottlenecks, and enhances overall infrastructure resiliency. Merging OpenTelemetry with the ELK stack provides the required utilities to support efficient monitoring and maintain high efficiency of application processes.
Conclusion
With the increased complexity in distributed systems and expanded utilization of microservice architecture, monitoring and logging have gained prominence as main factors in providing stability as well as reliability. Effective request tracing and correlation of logs makes it possible to track request flows across multiple services, which helps in locating bottlenecks as well as makes failure rectification quick. Log aggregation and anomaly detection help identify deviations in system behavior, preventing critical failures and ensuring data security. Performance monitoring based on distributed context provides a comprehensive understanding of system processes, simplifying metric analysis and workload forecasting.
The application of observation and logging tools based on distributed context provides high performance and fault tolerance in distributed systems. Proper configuration of tracing, aggregation, and log analysis processes reduces application downtime and optimizes resource utilization, enhances user experience, and reduces infrastructure maintenance costs. The adoption of OpenTelemetry and the ELK stack unlocks new possibilities for in-depth system analysis, making it a cornerstone for the successful operation of modern high-load applications.
References:
- Ashok S., Harsh V., Godfrey B., Mittal R., Parthasarathy S., Shwartz L. TraceWeaver: Distributed Request Tracing for Microservices Without Application Modification // Proceedings of the ACM SIGCOMM 2024 Conference 2024. P. 828-842). DOI: 10.1145/3651890.3672254
- Talaver V., Vakaliuk T.A. Telemetry to solve dynamic analysis of a distributed system // Journal of Edge Computing. 2024. Vol. 3(1). P. 87-109. DOI: 10.55056/jec.728 EDN: FDFNRH
- Israfilov A. The role of cloud technologies in building resilient cybersecurity systems // International Journal of Humanities and Natural Sciences. 2024. Vol. 9-2(96). P. 105-109. DOI: 10.24412/2500-1000-2024-9-2-105-109 EDN: OSQQSV
- Elastic stack (ELK) / 6sense // URL: https://6sense.com/tech/hosted-search/elastic-stack-elk-market-share (date of application: 19.02.2025).
- Fu Y., Yan M., Xu Z., Xia X., Zhang X., Yang D. An empirical study of the impact of log parsers on the performance of log-based anomaly detection // Empirical Software Engineering. 2023. Vol. 28(1). № 6. DOI: 10.1007/s10664-022-10214-6 EDN: AAKUBP
- Le V.H., Zhang H. Log-based anomaly detection with deep learning: How far are we? // Proceedings of the 44th international conference on software engineering 2022. P. 1356-1367.
- Dudak A. Microservice architecture in frontend development // Norwegian Journal of development of the International Science. 2024. № 145. P. 99-102. DOI: 10.5281/zenodo.14236033 EDN: DFRMBD
- Malygin D. S. Monitoring the Availability of a Web Service in Distributed Infocommunication Systems // International Research Journal. 2024. №. 3(141). DOI: 10.23670/IRJ.2024.141.31 EDN: OUGUEQ