Simplifying DevOps Monitoring with OpenTelemetry
As part of Solutions Review’s Premium Content Series—a collection of contributed columns written by industry experts in maturing software categories— Dotan Horovits of Logz.io provides an in-depth overview of OpenTelemetry and how it can help simplify DevOps monitoring.
“Microservices” is the new norm for building products these days. It bears many advantages in accelerating development, but it also makes systems more complex to monitor. These systems are often polyglot, leveraging multiple programming languages, each having its own way of being monitored. Furthermore, today’s systems make extensive use of third-party tools and frameworks to accelerate development and keep the engineering team focused on the business differentiators. These third parties could be open-source projects, proprietary tools, or cloud services. And while we didn’t write these frameworks, we still need to effectively monitor them, to gain end-to-end observability across our system.
Observability is the paradigm that enables us to monitor and understand our systems. The formal definition, taken from Control Theory, discusses “a measure of how well internal states of a system can be inferred from knowledge of its external outputs.”
In this article, I’d like to look into the various types of observability data we need to collect and how to simplify the collection process using the OpenTelemetry open-source project, a fascinating project under the Cloud Native Computing Foundation.
OpenTelemetry: DevOps Monitoring Simplified.
Observability Signal Types
Every system emits different telemetry signals, which help us understand its internal state. The classic signs are logs, metrics, and traces, often referred to as “the three pillars of observability.” Logs and metrics have been around for decades, while Distributed Tracing is younger in comparison but gaining strong momentum. Other signal types are emerging, such as continuous profiling, but these are much earlier in the adoption cycle. The observability data also stems from a large variety of data sources, both in the application and in the infrastructure. A typical system may consist of a Python frontend, a Java backend, some databases, and a messaging service such as Kafka (or its cloud service equivalents) – each emitting its own logs, metrics, and other telemetry data in need of ingestion.
Collecting such heterogeneous data in a unified manner has been a challenge for many years. Proprietary solutions provided by vendors failed to keep up with the ever-expanding ecosystem of third-party tools and the advancements in programming languages. Furthermore, each tool and each vendor had its own proprietary APIs and SDKs for instrumenting the application code, proprietary agents to collect and process the data, and proprietary protocol for transmitting the telemetry to the analytics backend. That effectively created data silos that prevented full observability. We needed to take a different approach and look at observability as a data analytics problem.
Unified Data Collection with OpenTelemetry
The open-source community that brought us Kubernetes and the cloud native ecosystem has also delivered open-source tools and standards to monitor them. One important project under the Cloud Native Computing Foundation (CNCF) is OpenTelemetry, an observability framework that assists in generating and collecting telemetry data from cloud-native software.
Let’s take a closer look at what the OpenTelemetry framework provides us. For each programming language, OpenTelemetry offers a single API and SDK (i.e., client library) for instrumenting the application. It also provides a unified Collector that can collect telemetry from multiple sources, whether your application or infrastructure components, over various protocols. The Collector then processes the telemetry data and exports it to any observability analysis backend tool or downstream system over multiple protocols. It’s important to note that the OpenTelemetry project does not provide the analysis backend nor take any stake in it. Last but not least, OpenTelemetry offers a unified protocol, OTLP, for transmitting logs, metrics, and trace data.
OpenTelemetry isn’t yet another open-source project. In fact, it’s the most active project in the entire CNCF after Kubernetes. All the monitoring and observability vendors, as well as the major cloud providers, have started to align with OpenTelemetry, even at the cost of sunsetting lucrative proprietary agents. This is a positive sign that OpenTelemetry is set out to become the new de-facto standard for telemetry collection. Gartner’s recent Hype Cycle for Emerging Technologies (2022) even listed OpenTelemetry within the Innovation Trigger, estimating that it will reach the plateau on Gartner’s hype cycle within 2-5 years.
OpenTelemetry is a relatively young project, but it is already generally available for use with Distributed Tracing telemetry data and is in the Release Candidate stage (the step before general availability) for use with Metrics data. The least evolved of the signals is Logs, which will not be generally available before 2023. OpenTelemetry’s roadmap also looks into additional signals beyond the “three pillars,” with Continuous Profiling being the first one.
Getting Started with OpenTelemetry
OpenTelemetry is not a single monolithic project but rather a collection of projects run by multiple working groups. This can make starting with OpenTelemetry quite confusing. When you start with OpenTelemetry, it’s important to know your tech stack.
Start with these four questions, which will direct you to the components relevant to your system:
Which programming languages?
Which programming languages do you use? That will determine the OpenTelemetry APIs and SDKs relevant for you, and potentially also auto-instrumentation agents. Go the extra mile and determine the programming frameworks you use with each language to see which integrations exist for them.
Which signal types and protocols?
Next, determine which observability signals you intend to ingest to determine the relevant Collector receivers you’d be using. Start by figuring out which signal types are of interest to you, among traces, metrics, and logs. Also, see which telemetry protocols you should support. This is especially important with brownfield projects, having existing and potentially legacy components already emitting telemetry in certain protocols to which you need to adhere.
Which infrastructure components?
Next, list down which are the sources of the signals, that is, the components you monitor. Many infrastructure components have their own formats and use designated receivers to ingest them. This is true for open-source tools such as Kubernetes, Kafka, MySQL, and HTTPD. It is also true for cloud services such as AWS X-Ray and GCP pubsub, or even with existing telemetry collectors such as CollectD or StatsD.
Which backend analytics tools?
Lastly, you’d need to define which tool stack you intend to use for running analytics on your telemetry data. It may be an open-source tool, a proprietary one, or a cloud service. It could also be another downstream system that receives the data and processes it. This will help you determine the relevant Collector exporters you’d be using.
Endnote
OpenTelemetry is a young but promising project that is set out to become a de-facto standard. It also bears the promise of unified observability on the side of generating and collecting the observability data. Having this project under the wings of the Cloud Native Computing Foundation, alongside Kubernetes, Prometheus, and other leading projects in the space will further facilitate the collaboration to ensure compatibility across the stack.