In this post, we will delve into the process of setting up CloudWatch alerts for specific scenarios while managing a production AWS EKS instance, highlighting key insights and practical solutions.
Co-authored by Saurabh Hirani
Reviewed and edited by Spandan Ghosh
Our story begins with a team tasked with setting up CloudWatch alerts for a production AWS EKS (Elastic Kubernetes Service) cluster. The primary goal was to ensure that the cluster nodes were always in a "ready" state, capable of accepting workloads.
The Initial Challenge
The team began by leveraging AWS Container Insights, which collects, aggregates, and summarizes metrics and logs from containerized applications and microservices. Since the customer did not have Prometheus set up and preferred to use AWS CloudWatch metrics for all alerts, we focused on AWS Container Insights metrics.
We concentrated on the node_status_condition_ready
metric described here, which indicates whether a node is ready to accept workloads. If a node is ready, the value of this metric is 1
; otherwise, it is 0
.
Our initial approach was influenced by PromQL, which we had used previously. This led to some unexpected challenges as we transitioned to CloudWatch.
Static Nodes in Staging
In the staging environment, the team had a fixed set of 3 EKS nodes. This made it easier to test our alerting strategy. We created individual alarms for each node, triggering an alert if the node_status_condition_ready
metric dropped below 1.
Alarm Scenario: If 2 out of 3 nodes became not ready, 2 separate alarms would be triggered, reflecting the issue.
While this approach worked well for a small, static setup, it was clear that it wouldn't scale well for larger, dynamic environments.
Dynamic Nodes in Load Testing
Before moving to production, we tested our strategy on a load testing cluster with auto-scaling enabled. Here the previous approach of setting individual alarms for each node became impractical due to the dynamic nature of the nodes.
This is where CloudWatch Metric Insights proved valuable. We used a query to retrieve the readiness status for all nodes dynamically:
We were particularly interested in identifying nodes with a readiness value of 0
. so we used:
However, this produced a multi-time series output, which needed to be aggregated into a single time series for CloudWatch alarms.
When configuring alarms, CloudWatch offers the following aggregation functions:
These can be used to consolidate the multi-time series data into a single time series for the alarm.
We used the MIN(MIN)
approach to create a single time series indicating whether at least one node was not in the "Ready" state.
Alarm scenario: If at least one node is not ready, the alarm will trigger.
Pros: No need to specify a static list of nodes, making it suitable for dynamic environments.
Cons: Once the alarm triggers, the operator must investigate further in the Cloudwatch dashboard to determine how many nodes are impacted. This introduces an additional level of indirection, meaning that the criticality of the situation can only be assessed after receiving the alert. This might lead to unnecessary wake-up calls for issues that might not be severe enough to warrant immediate action.
Attempt at Improving Auto-Scaled Nodes Alerting
To create a more balanced alerting strategy that considered the proportion of failing nodes, we needed to assess the question: "One node out of how many?"
By calculating the count or even better, the percentage of failing nodes, we could tailor our alerting response more effectively. This approach allowed us to differentiate between scenarios like 1 out of 2 nodes
failing versus 1 out of 100
, leading to more appropriate responses and reducing unnecessary alert fatigue.
After consulting with AWS, we found a more refined solution: using Metric Insights to calculate the percentage of faulty nodes.
This is done using the AVG(SUM)
approach.
Metric insights query:
Then apply the AVG(SUM)
approach:
This seemed logical enough, but when we tested it we ran into an interesting limitation.
The Aggregation Conundrum
When we delved deeper, we encountered an unexpected issue.
Metrics showing SUM = 8
This was perplexing.
How can the sum of metrics for 4 nodes be 8 when the maximum sum should be 4?
Metrics showing AVG = 1.6
How can the average value be 1.6 when the maximum possible average for 4 nodes is 1?
The root cause lied in CloudWatch Metrics Insights' behavior.
When aggregating metrics, CloudWatch includes a group called "Other," which represents the overall sum of all values.
This "Other" group is factored into the SUM
, causing unexpectedly high values and skewed averages.
Workaround:
To correct this, we realized that we should be specific about the SCHEMA
used in the query. To exclude the Other
group we should change our query from
to
This gave us the right aggregation value when we applied AVG
on it. We got the correct value 1
instead of 1.6
invalid value as seen earlier.
Handling Non-Boolean Metrics
The above scenario had the advantage of working with boolean values i.e. the value of the metric was either 1
or 0
. If you need to alert based on non-boolean metrics, such as CPU utilization, the approach becomes more complex. For example, to raise an alarm if more than 50% of nodes have CPU utilization greater than 80%, you cannot directly use the SUM
and AVG
method:
This query would sum up CPU utilization values rather than count the number of nodes exceeding the threshold:
Final Solution
Given the complexities and limitations encountered, we arrived at a multi-faceted final solution tailored to different scenarios.
1. For static nodes - use per node alarm1
In environments with a small, static number of nodes, setting individual alarms for each node is feasible. This approach works well when the number of nodes is fixed and manageable.
Alarm Scenario: If 2 out of 3 nodes become not ready, two separate alarms will be triggered, reflecting the issue.
2. For auto scaled nodes - use Metrics Insights Query with MIN aggregator to alert for at least 1 node failing
For dynamic environments where nodes are scaled in and out, using CloudWatch Metric Insights with a GROUP BY
clause is more practical.
3. For auto scaled nodes - use Metrics Insights Query with AVG(SUM) aggregator
4. Handling - non-boolean metrics
This is not possible given the SUM(AVG)
aggregation works on metric values and not on metric count.
5. If we were using PromQL
Using Prometheus would simplify such queries significantly. For instance, to calculate the percentage of nodes that are not ready:
Similar query will work for counting nodes having CPU utilization greater than a specific threshold. ****
6. Custom Metric Solution in CloudWatch
To achieve similar functionality in CloudWatch, you would need to create a custom solution using a Lambda function:
Query CPU Utilization: The Lambda function would query the CPU utilization metrics.
Emit Custom Metric: For each node, emit a custom metric
cpu_util_gt_80
with a value of 1 if the CPU utilization is greater than 80%, otherwise 0.Use AVG(SUM): Use the
AVG(SUM)
approach on the custom metric to calculate the percentage of nodes exceeding the threshold.
This approach adds complexity and incurs additional costs but provides the necessary flexibility for advanced alerting scenarios.
Conclusion and Next Steps
While CloudWatch Container Insights provides powerful tools for monitoring and alerting, it has limitations, especially in dynamic environments. By understanding these limitations and leveraging the right tools and workarounds, you can create effective alerting strategies.
If you're facing similar challenges or need expert guidance on setting up and managing your AWS EKS clusters with CloudWatch, we are here to help. Our team specializes in providing comprehensive solutions for AWS monitoring, logging, and alerting.
Contact us today to learn more about how One2N can help you streamline your AWS operations and ensure your applications are always running smoothly.
By leveraging our expertise, you can focus on what matters most – delivering high-quality applications and services to your users.