How to Get the Most Out of Prometheus Kubernetes Monitoring
This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, Pepperdata CEO Maneesh Dhir offers his take on how to get the most out of Prometheus Kubernetes monitoring right now.
The growth of container-based infrastructures in the cloud, like Kubernetes, has required developers and IT Ops to approach monitoring, logging, and debugging in new ways. Migrating to Kubernetes often introduces a number of challenges as enterprises grapple with the wide variety of new services, requests, and virtual network addresses that are often ephemeral and tend to generate a huge volume of metrics, especially in high-churn environments where new pods are being created and moved frequently.
The Emergence of Prometheus for Kubernetes Monitoring
Prometheus was initially developed by Soundcloud as an in-house monitoring solution. At the core, Prometheus is a time-series database, but its key advantage is scraping and pulling metrics from services. Today, Prometheus offers users an open-source, robust, simple, and scalable solution for extremely fast reporting of time-series data sets.
Capable of reliably handling billions of metrics generated by dynamic cloud environments, Prometheus has become the de facto standard monitoring technology for Kubernetes environments. Prometheus provides reliable, responsive, and flexible monitoring capabilities. This is crucial, as more than 77 percent of businesses have migrated 50 percent or more of their workloads to Kubernetes in 2021, according to a 2021 poll by big data optimization software vendor Pepperdata.
The growth of Prometheus has been driven not only by the growth of Prometheus but also by the emergence of a DevOps culture. Prior to DevOps becoming mainstream, monitoring consisted of hosts, networks, and services. The advent of DevOps highlighted the need for monitoring to be democratized,
made easily accessible, and expanded to cover more layers of the stack. Today, because developers actively participate in the CI/CD pipeline and perform operations debugging on their own, they need the ability to quickly integrate data from apps and other business-related metrics.
Strengths of Prometheus
Prometheus offers many benefits and advantages to DevOps over alternative monitoring systems:
- High Availability: Prometheus was engineered with high availability and reliability at its core. It was designed as the go-to tool to help developers understand why other components are failing. Prometheus nodes in a cluster are fully autonomous and do not depend on remote storage; as a result, Prometheus can maintain a high level of uptime.
- Responsiveness: Prometheus excels at real-time monitoring, providing developers a picture of what happened inside a system, and is currently happening within a system, down to the last few seconds. Prometheus can be used to power highly responsive alerting systems and empower developers with the metrics to debug errors in real-time.
- Flexibility: Prometheus is packaged with a querying language called PromQL that allows developers to select and aggregate time series data in real-time. These query results can be integrated into alerting systems or advanced graphing systems so that developers can easily understand and visualize the results of their monitoring efforts.
Making the Most of Prometheus
Prometheus has its roots in microservices, which tend to be always on. Prometheus wasn’t designed for workloads where there is a constant enabling and disabling of nodes, for example, batch workloads like Spark. Such workloads comprise many of today’s enterprise applications. According to the 2021 Big Data on Kubernetes Report, Spark in particular is one of the first application types that people migrate to Kubernetes.
To get the most out of Prometheus in a batch workload environment, you will want to take all of the following factors into consideration. You might need to design processes to accommodate them or sourcing additional solutions to bridge these gaps in your DevOps organization:
Lack of Long-Term Data
Prometheus was not designed to store data for long periods of time. The default retention window is just 15 days, making it difficult to recover data older than that in Prometheus. This can create challenges when looking back over history to uncover the root cause of past incidents, or for examining trends in incidents over time. In addition to the lack of long-term data, Prometheus data by default does not have any backup and is thus not fault-tolerant out of the box. Thanos is an open-source project that attempts to support long-term storage and scale for Prometheus. Thanos does this by dividing data into partitions. However, if one partition grows too large, Thanos requires expensive re-partitioning, which can be challenging and error-prone to implement.
Basic Spark Metrics
Another strength of Prometheus is offering a huge variety of metrics out of the box. With regard to Spark metrics in particular, however, Prometheus relies on the relatively limited set of Spark core metrics, which lack insight into the performance of the all-important Spark stages that are crucial to understand when troubleshooting Spark applications.
Challenges in Joining Metrics
By joining the many metrics Prometheus provides, developers can create interesting and sophisticated alerts. However, the process of joining metrics can be time-consuming. One solution would be to pre-populate joined metrics in advance to accelerate the process. However, rules for pre-populating joined metrics in Prometheus must be added manually and can be challenging for those who are new to or inexperienced with this.
Basic user interface
Prometheus was designed to power visualization systems, so its own user interface is designed primarily for advanced developers who know exactly what they are looking for, not necessarily operations teams troubleshooting an unknown issue. The Prometheus UI also lacks some business-oriented features like role-based access control (RBAC) to create permissions around which users can access which data. Prometheus does recommend third-party UI solutions such as Grafana, which does support RBAC.
In fact, approximately 75 percent of Prometheus users also use Grafana, and 67 percent of Grafana users use Prometheus. However, Grafana can be challenging to set up because all the metrics and RBAC permissions must be configured first, meaning Grafana can’t be used right out of the box. The lack of readily available enterprise-grade features is a key reason that some enterprises have been hesitant to roll out Prometheus in a widespread way inside their organizations.
To summarize, although Prometheus functions as a great out-of-the-box platform for smaller deployments, offering reliable, responsive, and flexible monitoring capabilities, running it at scale can result in a number of uniquely difficult challenges, particularly for large-scale batch workloads on Kubernetes.
Prometheus Plus Optimization
Let’s assume your DevOps team implements a monitoring system like Prometheus. Once your deployment is stable and you start to monitor your spend, you may discover your cloud bill increasing tangibly month over month. Many of Pepperdata’s customers, which are some of the largest enterprises in the world, come to us having been through a similar journey.
That’s when you realize you need to move to the next level, which is cost optimization. Now there’s a fork in the road for your business.
For some, the preferred next step is to start implementing recommendations for cost savings and efficiency. Many monitoring and observability systems produce such recommendations, but they typically require skilled and highly knowledgeable resources to implement them and then tune them once implemented.
Many other enterprises discover constraints with staffing, lack of in-depth knowledge across the vast repertoire of new and emerging big data technologies, and the sheer number of hours in the week. These constraints make implementing and tuning such recommendations challenging, that’s where they turn to autonomous optimization.
By implementing autonomous optimization, you can recognize greater cost savings and accelerate your business goals without increasing your people pool or other investments.
Choose the Best Solution for Your Environment
The ecosystem for Kubernetes tools is vast and expanding, and it can be confusing. Choosing the right Kubernetes monitoring tools is crucial to optimizing the performance of your stack and maximizing ROI out of your new investment. Without the proper solutions on your stack, DevOps may be missing out on resiliency, better performance, simplified operations, and lower costs.
Prometheus provides highly reliable and detailed metrics focused on near-real-time performance. Its strengths lie in its extremely high availability, responsiveness, and flexibility, and, as a result, it is considered the de facto monitoring system for Kubernetes. Challenges with data retention, joining metrics, and enterprise-grade visibility, can emerge when addressing the unique needs of batch workloads, like Spark. Thus, many enterprises choose to enhance Prometheus by augmenting it with autonomous optimization for even more robust support of batch workloads on Kubernetes, such as Spark.