Observability is at the core of any modern Site Reliability Engineering (SRE) practice. When teams depend on people manually scanning dashboards to spot issues, they risk missing critical incidents. This is the story of how we turned a messy, manual, and reactive monitoring setup into an automated, organized, and scalable alerting framework built primarily on the Elasticsearch-Logstash-Kibana (ELK) stack and guided by GitOps principles.
Trial by fire - an unusual pattern spotted for a scale too large!
The existing monitoring system at our client’s infrastructure was fragmented across multiple tools and platforms. Logs and metrics scattered between ELK, AppDynamics, and Prometheus and human-led monitoring efforts, that required operators to manually inspect dashboards every 30 minutes and report for spikes, drops and flatline conditions. This showed us a trending pattern of alarming incident rates due to blind spots in alerting, human inability to monitor multiple dimensions across multiple dashboards at once and the overwhelming cognitive load on operations teams that occurs while incidents take place.
For the client we were working with, rougly about 80% of the critical data both application and system logs flowed into Elasticsearch. This made Elasticsearch the clear priority for building an automated alerting system that could deliver the punch required.
The scale we were looking at was pretty staggering:
20+ Elasticsearch nodes managing over 100 terabytes of log data,
4 TB of daily ingestion and rotation,
400+ indices constantly churning out real-time data
1200+ Kibana dashboards
2000+ visualization panels
With dashboards being manually inspected and no structured automation in place, the risk of missing critical issues was extremely high that teams were required to operate in shifts, covering with attention and reporting with responsibility. Relying solely on human monitoring to detect issues at this scale was unsustainable. Without intervention, business-impacting incidents would continue to slip through the cracks.
This is where One2N stepped in - Remember systems at scale is our jam, let’s continue.
Understanding the Problem - Identifying Gaps in Alerting
To solve this issue effectively, we first needed to understand the problem from both a technical and operational perspective. We asked some fundamental questions before attempting technical ones (like every sane engineer should)
What should be monitored?
What signals indicate real issues?
Who needs to be alerted, and how?
How can we reduce noise and focus on actionable insights?
What criteria or decision matrices are used to evaluate anomalies?
What became clear was that alerting wasn’t just about the tools, it was about how people interacted with systems and with each other. This was a team that hadn’t been previously exposed to technical troubleshooting, and rather only relied on spotting, thus it was natural to understand that they did not approach troubleshooting from that lens. While ELK was heavily relied upon for post incident debugging and reporting issues, there was no structured mechanism to ensure that critical issues were proactively detected and responded to in a manner that felt uniform across multiple occurrences (yes, we’re heading towards a runbook).
To streamline this, we had to first have an extremely clear picture of what needed to be monitored, how did the operations team do it, and how could our automation do it. We introduced a deliberately simple, almost boring method that captured “what to alert on” in a shared Excel sheet. It may seem unsophisticated, but it served as an effective collaborative contract that we could set with the team, a baseline agreement between engineering and operations that truly made the difference.
For every alert use case the team had in mind , this is how it was processed by us
Field | Value |
---|---|
Sr.No | |
Alert / Report Description | |
Dashboard link | |
Thresholds and Severity | |
Specs / Filters to Produce Results | |
Alert Deployed? | |
Reviewed by Ops Team | |
Final Sign Off by Ops Team | |
Review Comments (if any) |
Review of the alert stage - Meant that Ops team upon receiving the set alert, would conduct a detailed validation of the data calculation, condition evaluation and arrive at the conclusion that if they indeed received a valuable alert.
Final Sign Off Stage - Meant that, upon reviewing this alert for a week or so , they trust it enough to stop looking at dashboards (like a NOC team)
Choosing the Right Alerting Mechanism - Kibana Alerts vs. Watchers
Before committing to proposing a solution to fix this, we explored two built-in options available within the ELK ecosystem: Kibana Alerts and Elasticsearch Watchers. Furthermore, we landed on some open source solutions too, the most prominent one of them being ElastAlert (repo-link) a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch, designed by Yelp initially, and improvised by Jason Ertel.
However we soon realised that implementing something native (within ELK) and supported by Enterprise Support would be a greater win than relying on a community supported library (ElastAlert2) as an enterprise user.
The 2 options we evaluated are shared with our findings below:
Kibana Alerts: (built into Kibana)
Provided a straightforward UI and API driven approach to setting up alerts. They were easy to configure and could be triggered based on simple threshold-based conditions. However, we quickly found limitations:
Kibana Alerts could only operate on predefined conditions and pre-defined rules and lacked flexibility.
They struggled with historical data comparisons or tracking time boxed data over time. (now-1h , now-1d-1h) for example.
The conditions they could track were binary, or stepping changes in nature, or they couldn't handle multi-step logic sequences like: (run a high-level query → then aggregate over time buckets → then apply conditional checks on the results → triggering an action) and so on.
They lacked support for relative percentage-based calculations (e.g., a 20% drop compared to yesterday).
They excelled at incident management though, with built-in features like snoozing, recovery alerts, and alert state tracking.
So while they were great for basic cases and day-to-day triaging, we knew they wouldn’t hold up for the complex business logic we needed to model.
Kibana Alerts UI

Setting an Alert in Kibana -

Link to Kibana Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/alerts
Watchers: (built into Elasticsearch)
Elasticsearch Watchers, on the other hand, offered greater control over chaining the logic that leads to your alerting logic. For example, they
Support a range of input types, from simple static payloads to chaining multiple sources like HTTP calls or EQL searches all of which can feed into the evaluation logic.
Allow complex condition evaluation using the Painless scripting language (ironically named, because writing some of those scripts definitely wasn’t), helping bridge the gap between raw data and deciding whether to act on it.
Handling of relative drop conditions and comparisons with prior time buckets, which Kibana Alerts didn’t support.
That said, When compared from an Incident Lifecycle Management POV - Watchers don’t plug themselves into it as much as Kibana Alerts does. No snoozing, no built-in recovery notifications for Watchers can be tricky to handle because post-processing of alerts (when you continue to receive the same alert multiple times) now falls to end user. We can engineer this, but its effort.
Watchers to be explained in simpler terms, are dumb crons - think of them like powerful background jobs that query, evaluate, and act but they don’t carry state or support workflows beyond their configured action. While they can be acknowledged, no new alerts are fired until the original condition resets, which can be tricky to configure if you have multi-dimensional items to report on (For example - stocks falling below 5% value) , can be 2 different lists Alert Body A (Stocks - 1,2,3) and Alert Body B (Stocks - 4,5,6) obtained if run at two consecutive 1 hour intervals. But for the first (alert 1) to recover, we need exactly (Stock - 1,2,3) have to raise above 5% which necessarily might not be the case.
Watcher UI -

Watcher Firing History

Link to Watcher Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/watcher
After evaluating both options, we decided that Watchers were the best fit for our needs because they gave us the flexibility to create alerts that were deeply integrated with the way our business operated.
When to use what offering?
If we needed an alert when users dropped below 1000 in the last hour - User Kibana Alerts
If we needed an alert when users from the US, Germany, and India dropped below 1000 in the last hour compared to yesterday - Use Elasticseach Watchers - because Kibana Alerts does not understand what “compared to yesterday” means, or how to calculate such a value.
How We Used Elasticsearch Watchers to Automate Alerting
After scoping our future alerting requirements, we aligned on Elasticsearch Watchers as the foundation for building alert conditions. This gave us the flexibility to define logic around business-specific metrics, compare against historical trends, and schedule alerts based on real-world usage patterns. Alerts weren’t just triggered; they were tied to signals that actually mattered to the business.
To make this framework operationally insightful, we designed every alert to log its payload into a dedicated Elasticsearch index. Each log carried a unique watcher_id, which acted as a traceable fingerprint for that alert. This served as the system’s single source of truth, allowing us to track how often alerts fired, identify noisy patterns, and tune thresholds over time.
What began as a simple alerting pipeline gradually matured into an intelligent feedback loop. By analyzing the alert history, we were able to detect trends, reduce false positives, and refine conditions with each iteration. The system not only scaled but also improved continuously.
Built-in Alert Insights - We were able to query our own alerts data over (24h, 1d, 1w) and so on!

Garbage In , Treasure Out - Seems Not!
Example of An Elasticsearch Watcher Payload Body:
{ "metadata" : { "color" : "red" }, "trigger" : { "schedule" : { "interval" : "5m" } }, "input" : { "search" : { "request" : { "indices" : "log-events", "body" : { "size" : 0, "query" : { "match" : { "status" : "error" } } } } } }, "condition" : { "compare" : { "ctx.payload.hits.total" : { "gt" : 5 }} }, "transform" : { "search" : { "request" : { "indices" : "log-events", "body" : { "query" : { "match" : { "status" : "error" } } } } } }, "actions" : { "my_webhook" : { "webhook" : { "method" : "POST", "host" : "mylisteninghost", "port" : 9200, "path" : "/{{watch_id}}", "body" : "Encountered {{ctx.payload.hits.total}} errors" } }, "email_administrator" : { "email" : { "to" : "sys.admino@host.domain", "subject" : "Encountered {{ctx.payload.hits.total}} errors", "body" : "Too many error in the system, see attached data", "attachments" : { "attached_data" : { "data" : { "format" : "json" } } }, "priority" : "high"
Leveraging GitOps for Scalable Alert Management
To ensure alerts were managed systematically, we built a Gitlab repository with a few initial features - we kicked off, some of which are mentioned below, the pipeline
Deploys any new alert rule logic added under the watcher folder structure.
Enables create, read, update, and delete (CRUD) operations on Watchers (selectively and en-masse! - yes we had that itch)
Mass Enable , or Disable all alert rules at once
Maintains version control, allowing teams to track changes and rollback if needed.
Requires approval flows to ensure human oversight before deploying new alert conditions.
Pipeline observability and notifications were integrated for operators , managers and business leaders accordingly to consume at required context levels.
Supports multi-platform observability across environments, business verticals, and metric types by adopting a clean, extensible folder structure
Repository Structure -
├── <business-vertical>/ <environment>/ # Environment layer (e.g., production/, staging/, dev/) ├── <business-vertical>/ # Platform/product vertical │ ├── business-metrics/ # Business-facing alerts (e.g., revenue drops, game errors) │ │ ├── <business-metric>-alert.json │ └── infrastructure/ # Infra-focused alerts (e.g., CPU, memory, container errors) │ ├── <infra-metric>
Key Principles:
By treating alerts as code, operations teams could submit pull requests to update alert logic, and changes would be reviewed before deployment. This created a structured workflow where alerting became a collaborative process rather than a one-time configuration task.
Toggle-able & Composable:
Each directory level (e.g.,
/<environment>/<product>/kibana/watcher
) is independently manageable, enabling selective deployments across products, environments, and metric domains.Decoupled Thresholds:
Each alert rule (watcher) contains its own threshold logic, allowing isolated adjustments without affecting unrelated alerts.
GitOps Workflow:
Pull Request: Validates alert format; optionally triggers dry-run deploys.
Merge: Auto-validates, deploys to Elasticsearch/Kibana, and notifies stakeholders.
Five Key Takeaways from This Journey
Observability is Not Just a Tooling Problem
It's easy to assume that adding more dashboards, more alerts, or another monitoring tool will solve the issue. The reality is that observability is about how teams interact with monitoring systems. If human operators are still manually inspecting dashboards, even the best tools won’t fix the problem. The key is to build automation that integrates seamlessly into existing workflows and reduces operational toil.
Understanding the business context is as important as the tech that powers it
Engineers often jump straight into solutions, but the first step should always be understanding how the business functions. We only realized the value of Elasticsearch Watchers after analyzing how teams monitored incidents and what metrics truly mattered. Without this context, we might have built an overly complex system that didn't serve the right purpose.
Your Alerts Need to Be Accountable
Sending alerts isn’t enough; teams need visibility into how and why alerts fired. We created a self-improving alerting system by writing alert events into a dedicated Elasticsearch index. But this worked only because we knew what to log and why.
As business use-cases evolve, they surface new metrics and anomalies worth tracking. These in turn expose gaps in existing logs. Every meaningful alert is a reflection of complete, context-rich logging. Logging, alerting, and business logic move in lockstep. This feedback loop is what transforms alerting from a reactive signal into a proactive intelligence layer.
GitOps Brings Structure, Operators bring Discipline and Sanity
Automating alert management through GitOps improved consistency and control, but its biggest value was in keeping humans in the loop. With version control, teams could track changes, roll back faulty alerts, and introduce approval flows to prevent misconfigured conditions. Automation works best when it provides flexibility while ensuring safety.
Extracting the Right Information is a Skill, Not Just a Process
Whether dealing with logs, alerts, or operational teams, the ability to extract relevant insights is what makes engineers effective. Most of the critical information for this project did not exist in documentation, it came from asking the right questions, observing pain points, and analyzing historical data.
The best engineers are not just problem-solvers; they are problem-finders!
At One2N, we do so much more, apply to work with us at https://one2n.in/careers
Observability is at the core of any modern Site Reliability Engineering (SRE) practice. When teams depend on people manually scanning dashboards to spot issues, they risk missing critical incidents. This is the story of how we turned a messy, manual, and reactive monitoring setup into an automated, organized, and scalable alerting framework built primarily on the Elasticsearch-Logstash-Kibana (ELK) stack and guided by GitOps principles.
Trial by fire - an unusual pattern spotted for a scale too large!
The existing monitoring system at our client’s infrastructure was fragmented across multiple tools and platforms. Logs and metrics scattered between ELK, AppDynamics, and Prometheus and human-led monitoring efforts, that required operators to manually inspect dashboards every 30 minutes and report for spikes, drops and flatline conditions. This showed us a trending pattern of alarming incident rates due to blind spots in alerting, human inability to monitor multiple dimensions across multiple dashboards at once and the overwhelming cognitive load on operations teams that occurs while incidents take place.
For the client we were working with, rougly about 80% of the critical data both application and system logs flowed into Elasticsearch. This made Elasticsearch the clear priority for building an automated alerting system that could deliver the punch required.
The scale we were looking at was pretty staggering:
20+ Elasticsearch nodes managing over 100 terabytes of log data,
4 TB of daily ingestion and rotation,
400+ indices constantly churning out real-time data
1200+ Kibana dashboards
2000+ visualization panels
With dashboards being manually inspected and no structured automation in place, the risk of missing critical issues was extremely high that teams were required to operate in shifts, covering with attention and reporting with responsibility. Relying solely on human monitoring to detect issues at this scale was unsustainable. Without intervention, business-impacting incidents would continue to slip through the cracks.
This is where One2N stepped in - Remember systems at scale is our jam, let’s continue.
Understanding the Problem - Identifying Gaps in Alerting
To solve this issue effectively, we first needed to understand the problem from both a technical and operational perspective. We asked some fundamental questions before attempting technical ones (like every sane engineer should)
What should be monitored?
What signals indicate real issues?
Who needs to be alerted, and how?
How can we reduce noise and focus on actionable insights?
What criteria or decision matrices are used to evaluate anomalies?
What became clear was that alerting wasn’t just about the tools, it was about how people interacted with systems and with each other. This was a team that hadn’t been previously exposed to technical troubleshooting, and rather only relied on spotting, thus it was natural to understand that they did not approach troubleshooting from that lens. While ELK was heavily relied upon for post incident debugging and reporting issues, there was no structured mechanism to ensure that critical issues were proactively detected and responded to in a manner that felt uniform across multiple occurrences (yes, we’re heading towards a runbook).
To streamline this, we had to first have an extremely clear picture of what needed to be monitored, how did the operations team do it, and how could our automation do it. We introduced a deliberately simple, almost boring method that captured “what to alert on” in a shared Excel sheet. It may seem unsophisticated, but it served as an effective collaborative contract that we could set with the team, a baseline agreement between engineering and operations that truly made the difference.
For every alert use case the team had in mind , this is how it was processed by us
Field | Value |
---|---|
Sr.No | |
Alert / Report Description | |
Dashboard link | |
Thresholds and Severity | |
Specs / Filters to Produce Results | |
Alert Deployed? | |
Reviewed by Ops Team | |
Final Sign Off by Ops Team | |
Review Comments (if any) |
Review of the alert stage - Meant that Ops team upon receiving the set alert, would conduct a detailed validation of the data calculation, condition evaluation and arrive at the conclusion that if they indeed received a valuable alert.
Final Sign Off Stage - Meant that, upon reviewing this alert for a week or so , they trust it enough to stop looking at dashboards (like a NOC team)
Choosing the Right Alerting Mechanism - Kibana Alerts vs. Watchers
Before committing to proposing a solution to fix this, we explored two built-in options available within the ELK ecosystem: Kibana Alerts and Elasticsearch Watchers. Furthermore, we landed on some open source solutions too, the most prominent one of them being ElastAlert (repo-link) a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch, designed by Yelp initially, and improvised by Jason Ertel.
However we soon realised that implementing something native (within ELK) and supported by Enterprise Support would be a greater win than relying on a community supported library (ElastAlert2) as an enterprise user.
The 2 options we evaluated are shared with our findings below:
Kibana Alerts: (built into Kibana)
Provided a straightforward UI and API driven approach to setting up alerts. They were easy to configure and could be triggered based on simple threshold-based conditions. However, we quickly found limitations:
Kibana Alerts could only operate on predefined conditions and pre-defined rules and lacked flexibility.
They struggled with historical data comparisons or tracking time boxed data over time. (now-1h , now-1d-1h) for example.
The conditions they could track were binary, or stepping changes in nature, or they couldn't handle multi-step logic sequences like: (run a high-level query → then aggregate over time buckets → then apply conditional checks on the results → triggering an action) and so on.
They lacked support for relative percentage-based calculations (e.g., a 20% drop compared to yesterday).
They excelled at incident management though, with built-in features like snoozing, recovery alerts, and alert state tracking.
So while they were great for basic cases and day-to-day triaging, we knew they wouldn’t hold up for the complex business logic we needed to model.
Kibana Alerts UI

Setting an Alert in Kibana -

Link to Kibana Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/alerts
Watchers: (built into Elasticsearch)
Elasticsearch Watchers, on the other hand, offered greater control over chaining the logic that leads to your alerting logic. For example, they
Support a range of input types, from simple static payloads to chaining multiple sources like HTTP calls or EQL searches all of which can feed into the evaluation logic.
Allow complex condition evaluation using the Painless scripting language (ironically named, because writing some of those scripts definitely wasn’t), helping bridge the gap between raw data and deciding whether to act on it.
Handling of relative drop conditions and comparisons with prior time buckets, which Kibana Alerts didn’t support.
That said, When compared from an Incident Lifecycle Management POV - Watchers don’t plug themselves into it as much as Kibana Alerts does. No snoozing, no built-in recovery notifications for Watchers can be tricky to handle because post-processing of alerts (when you continue to receive the same alert multiple times) now falls to end user. We can engineer this, but its effort.
Watchers to be explained in simpler terms, are dumb crons - think of them like powerful background jobs that query, evaluate, and act but they don’t carry state or support workflows beyond their configured action. While they can be acknowledged, no new alerts are fired until the original condition resets, which can be tricky to configure if you have multi-dimensional items to report on (For example - stocks falling below 5% value) , can be 2 different lists Alert Body A (Stocks - 1,2,3) and Alert Body B (Stocks - 4,5,6) obtained if run at two consecutive 1 hour intervals. But for the first (alert 1) to recover, we need exactly (Stock - 1,2,3) have to raise above 5% which necessarily might not be the case.
Watcher UI -

Watcher Firing History

Link to Watcher Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/watcher
After evaluating both options, we decided that Watchers were the best fit for our needs because they gave us the flexibility to create alerts that were deeply integrated with the way our business operated.
When to use what offering?
If we needed an alert when users dropped below 1000 in the last hour - User Kibana Alerts
If we needed an alert when users from the US, Germany, and India dropped below 1000 in the last hour compared to yesterday - Use Elasticseach Watchers - because Kibana Alerts does not understand what “compared to yesterday” means, or how to calculate such a value.
How We Used Elasticsearch Watchers to Automate Alerting
After scoping our future alerting requirements, we aligned on Elasticsearch Watchers as the foundation for building alert conditions. This gave us the flexibility to define logic around business-specific metrics, compare against historical trends, and schedule alerts based on real-world usage patterns. Alerts weren’t just triggered; they were tied to signals that actually mattered to the business.
To make this framework operationally insightful, we designed every alert to log its payload into a dedicated Elasticsearch index. Each log carried a unique watcher_id, which acted as a traceable fingerprint for that alert. This served as the system’s single source of truth, allowing us to track how often alerts fired, identify noisy patterns, and tune thresholds over time.
What began as a simple alerting pipeline gradually matured into an intelligent feedback loop. By analyzing the alert history, we were able to detect trends, reduce false positives, and refine conditions with each iteration. The system not only scaled but also improved continuously.
Built-in Alert Insights - We were able to query our own alerts data over (24h, 1d, 1w) and so on!

Garbage In , Treasure Out - Seems Not!
Example of An Elasticsearch Watcher Payload Body:
{ "metadata" : { "color" : "red" }, "trigger" : { "schedule" : { "interval" : "5m" } }, "input" : { "search" : { "request" : { "indices" : "log-events", "body" : { "size" : 0, "query" : { "match" : { "status" : "error" } } } } } }, "condition" : { "compare" : { "ctx.payload.hits.total" : { "gt" : 5 }} }, "transform" : { "search" : { "request" : { "indices" : "log-events", "body" : { "query" : { "match" : { "status" : "error" } } } } } }, "actions" : { "my_webhook" : { "webhook" : { "method" : "POST", "host" : "mylisteninghost", "port" : 9200, "path" : "/{{watch_id}}", "body" : "Encountered {{ctx.payload.hits.total}} errors" } }, "email_administrator" : { "email" : { "to" : "sys.admino@host.domain", "subject" : "Encountered {{ctx.payload.hits.total}} errors", "body" : "Too many error in the system, see attached data", "attachments" : { "attached_data" : { "data" : { "format" : "json" } } }, "priority" : "high"
Leveraging GitOps for Scalable Alert Management
To ensure alerts were managed systematically, we built a Gitlab repository with a few initial features - we kicked off, some of which are mentioned below, the pipeline
Deploys any new alert rule logic added under the watcher folder structure.
Enables create, read, update, and delete (CRUD) operations on Watchers (selectively and en-masse! - yes we had that itch)
Mass Enable , or Disable all alert rules at once
Maintains version control, allowing teams to track changes and rollback if needed.
Requires approval flows to ensure human oversight before deploying new alert conditions.
Pipeline observability and notifications were integrated for operators , managers and business leaders accordingly to consume at required context levels.
Supports multi-platform observability across environments, business verticals, and metric types by adopting a clean, extensible folder structure
Repository Structure -
├── <business-vertical>/ <environment>/ # Environment layer (e.g., production/, staging/, dev/) ├── <business-vertical>/ # Platform/product vertical │ ├── business-metrics/ # Business-facing alerts (e.g., revenue drops, game errors) │ │ ├── <business-metric>-alert.json │ └── infrastructure/ # Infra-focused alerts (e.g., CPU, memory, container errors) │ ├── <infra-metric>
Key Principles:
By treating alerts as code, operations teams could submit pull requests to update alert logic, and changes would be reviewed before deployment. This created a structured workflow where alerting became a collaborative process rather than a one-time configuration task.
Toggle-able & Composable:
Each directory level (e.g.,
/<environment>/<product>/kibana/watcher
) is independently manageable, enabling selective deployments across products, environments, and metric domains.Decoupled Thresholds:
Each alert rule (watcher) contains its own threshold logic, allowing isolated adjustments without affecting unrelated alerts.
GitOps Workflow:
Pull Request: Validates alert format; optionally triggers dry-run deploys.
Merge: Auto-validates, deploys to Elasticsearch/Kibana, and notifies stakeholders.
Five Key Takeaways from This Journey
Observability is Not Just a Tooling Problem
It's easy to assume that adding more dashboards, more alerts, or another monitoring tool will solve the issue. The reality is that observability is about how teams interact with monitoring systems. If human operators are still manually inspecting dashboards, even the best tools won’t fix the problem. The key is to build automation that integrates seamlessly into existing workflows and reduces operational toil.
Understanding the business context is as important as the tech that powers it
Engineers often jump straight into solutions, but the first step should always be understanding how the business functions. We only realized the value of Elasticsearch Watchers after analyzing how teams monitored incidents and what metrics truly mattered. Without this context, we might have built an overly complex system that didn't serve the right purpose.
Your Alerts Need to Be Accountable
Sending alerts isn’t enough; teams need visibility into how and why alerts fired. We created a self-improving alerting system by writing alert events into a dedicated Elasticsearch index. But this worked only because we knew what to log and why.
As business use-cases evolve, they surface new metrics and anomalies worth tracking. These in turn expose gaps in existing logs. Every meaningful alert is a reflection of complete, context-rich logging. Logging, alerting, and business logic move in lockstep. This feedback loop is what transforms alerting from a reactive signal into a proactive intelligence layer.
GitOps Brings Structure, Operators bring Discipline and Sanity
Automating alert management through GitOps improved consistency and control, but its biggest value was in keeping humans in the loop. With version control, teams could track changes, roll back faulty alerts, and introduce approval flows to prevent misconfigured conditions. Automation works best when it provides flexibility while ensuring safety.
Extracting the Right Information is a Skill, Not Just a Process
Whether dealing with logs, alerts, or operational teams, the ability to extract relevant insights is what makes engineers effective. Most of the critical information for this project did not exist in documentation, it came from asking the right questions, observing pain points, and analyzing historical data.
The best engineers are not just problem-solvers; they are problem-finders!
At One2N, we do so much more, apply to work with us at https://one2n.in/careers
Observability is at the core of any modern Site Reliability Engineering (SRE) practice. When teams depend on people manually scanning dashboards to spot issues, they risk missing critical incidents. This is the story of how we turned a messy, manual, and reactive monitoring setup into an automated, organized, and scalable alerting framework built primarily on the Elasticsearch-Logstash-Kibana (ELK) stack and guided by GitOps principles.
Trial by fire - an unusual pattern spotted for a scale too large!
The existing monitoring system at our client’s infrastructure was fragmented across multiple tools and platforms. Logs and metrics scattered between ELK, AppDynamics, and Prometheus and human-led monitoring efforts, that required operators to manually inspect dashboards every 30 minutes and report for spikes, drops and flatline conditions. This showed us a trending pattern of alarming incident rates due to blind spots in alerting, human inability to monitor multiple dimensions across multiple dashboards at once and the overwhelming cognitive load on operations teams that occurs while incidents take place.
For the client we were working with, rougly about 80% of the critical data both application and system logs flowed into Elasticsearch. This made Elasticsearch the clear priority for building an automated alerting system that could deliver the punch required.
The scale we were looking at was pretty staggering:
20+ Elasticsearch nodes managing over 100 terabytes of log data,
4 TB of daily ingestion and rotation,
400+ indices constantly churning out real-time data
1200+ Kibana dashboards
2000+ visualization panels
With dashboards being manually inspected and no structured automation in place, the risk of missing critical issues was extremely high that teams were required to operate in shifts, covering with attention and reporting with responsibility. Relying solely on human monitoring to detect issues at this scale was unsustainable. Without intervention, business-impacting incidents would continue to slip through the cracks.
This is where One2N stepped in - Remember systems at scale is our jam, let’s continue.
Understanding the Problem - Identifying Gaps in Alerting
To solve this issue effectively, we first needed to understand the problem from both a technical and operational perspective. We asked some fundamental questions before attempting technical ones (like every sane engineer should)
What should be monitored?
What signals indicate real issues?
Who needs to be alerted, and how?
How can we reduce noise and focus on actionable insights?
What criteria or decision matrices are used to evaluate anomalies?
What became clear was that alerting wasn’t just about the tools, it was about how people interacted with systems and with each other. This was a team that hadn’t been previously exposed to technical troubleshooting, and rather only relied on spotting, thus it was natural to understand that they did not approach troubleshooting from that lens. While ELK was heavily relied upon for post incident debugging and reporting issues, there was no structured mechanism to ensure that critical issues were proactively detected and responded to in a manner that felt uniform across multiple occurrences (yes, we’re heading towards a runbook).
To streamline this, we had to first have an extremely clear picture of what needed to be monitored, how did the operations team do it, and how could our automation do it. We introduced a deliberately simple, almost boring method that captured “what to alert on” in a shared Excel sheet. It may seem unsophisticated, but it served as an effective collaborative contract that we could set with the team, a baseline agreement between engineering and operations that truly made the difference.
For every alert use case the team had in mind , this is how it was processed by us
Field | Value |
---|---|
Sr.No | |
Alert / Report Description | |
Dashboard link | |
Thresholds and Severity | |
Specs / Filters to Produce Results | |
Alert Deployed? | |
Reviewed by Ops Team | |
Final Sign Off by Ops Team | |
Review Comments (if any) |
Review of the alert stage - Meant that Ops team upon receiving the set alert, would conduct a detailed validation of the data calculation, condition evaluation and arrive at the conclusion that if they indeed received a valuable alert.
Final Sign Off Stage - Meant that, upon reviewing this alert for a week or so , they trust it enough to stop looking at dashboards (like a NOC team)
Choosing the Right Alerting Mechanism - Kibana Alerts vs. Watchers
Before committing to proposing a solution to fix this, we explored two built-in options available within the ELK ecosystem: Kibana Alerts and Elasticsearch Watchers. Furthermore, we landed on some open source solutions too, the most prominent one of them being ElastAlert (repo-link) a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch, designed by Yelp initially, and improvised by Jason Ertel.
However we soon realised that implementing something native (within ELK) and supported by Enterprise Support would be a greater win than relying on a community supported library (ElastAlert2) as an enterprise user.
The 2 options we evaluated are shared with our findings below:
Kibana Alerts: (built into Kibana)
Provided a straightforward UI and API driven approach to setting up alerts. They were easy to configure and could be triggered based on simple threshold-based conditions. However, we quickly found limitations:
Kibana Alerts could only operate on predefined conditions and pre-defined rules and lacked flexibility.
They struggled with historical data comparisons or tracking time boxed data over time. (now-1h , now-1d-1h) for example.
The conditions they could track were binary, or stepping changes in nature, or they couldn't handle multi-step logic sequences like: (run a high-level query → then aggregate over time buckets → then apply conditional checks on the results → triggering an action) and so on.
They lacked support for relative percentage-based calculations (e.g., a 20% drop compared to yesterday).
They excelled at incident management though, with built-in features like snoozing, recovery alerts, and alert state tracking.
So while they were great for basic cases and day-to-day triaging, we knew they wouldn’t hold up for the complex business logic we needed to model.
Kibana Alerts UI

Setting an Alert in Kibana -

Link to Kibana Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/alerts
Watchers: (built into Elasticsearch)
Elasticsearch Watchers, on the other hand, offered greater control over chaining the logic that leads to your alerting logic. For example, they
Support a range of input types, from simple static payloads to chaining multiple sources like HTTP calls or EQL searches all of which can feed into the evaluation logic.
Allow complex condition evaluation using the Painless scripting language (ironically named, because writing some of those scripts definitely wasn’t), helping bridge the gap between raw data and deciding whether to act on it.
Handling of relative drop conditions and comparisons with prior time buckets, which Kibana Alerts didn’t support.
That said, When compared from an Incident Lifecycle Management POV - Watchers don’t plug themselves into it as much as Kibana Alerts does. No snoozing, no built-in recovery notifications for Watchers can be tricky to handle because post-processing of alerts (when you continue to receive the same alert multiple times) now falls to end user. We can engineer this, but its effort.
Watchers to be explained in simpler terms, are dumb crons - think of them like powerful background jobs that query, evaluate, and act but they don’t carry state or support workflows beyond their configured action. While they can be acknowledged, no new alerts are fired until the original condition resets, which can be tricky to configure if you have multi-dimensional items to report on (For example - stocks falling below 5% value) , can be 2 different lists Alert Body A (Stocks - 1,2,3) and Alert Body B (Stocks - 4,5,6) obtained if run at two consecutive 1 hour intervals. But for the first (alert 1) to recover, we need exactly (Stock - 1,2,3) have to raise above 5% which necessarily might not be the case.
Watcher UI -

Watcher Firing History

Link to Watcher Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/watcher
After evaluating both options, we decided that Watchers were the best fit for our needs because they gave us the flexibility to create alerts that were deeply integrated with the way our business operated.
When to use what offering?
If we needed an alert when users dropped below 1000 in the last hour - User Kibana Alerts
If we needed an alert when users from the US, Germany, and India dropped below 1000 in the last hour compared to yesterday - Use Elasticseach Watchers - because Kibana Alerts does not understand what “compared to yesterday” means, or how to calculate such a value.
How We Used Elasticsearch Watchers to Automate Alerting
After scoping our future alerting requirements, we aligned on Elasticsearch Watchers as the foundation for building alert conditions. This gave us the flexibility to define logic around business-specific metrics, compare against historical trends, and schedule alerts based on real-world usage patterns. Alerts weren’t just triggered; they were tied to signals that actually mattered to the business.
To make this framework operationally insightful, we designed every alert to log its payload into a dedicated Elasticsearch index. Each log carried a unique watcher_id, which acted as a traceable fingerprint for that alert. This served as the system’s single source of truth, allowing us to track how often alerts fired, identify noisy patterns, and tune thresholds over time.
What began as a simple alerting pipeline gradually matured into an intelligent feedback loop. By analyzing the alert history, we were able to detect trends, reduce false positives, and refine conditions with each iteration. The system not only scaled but also improved continuously.
Built-in Alert Insights - We were able to query our own alerts data over (24h, 1d, 1w) and so on!

Garbage In , Treasure Out - Seems Not!
Example of An Elasticsearch Watcher Payload Body:
{ "metadata" : { "color" : "red" }, "trigger" : { "schedule" : { "interval" : "5m" } }, "input" : { "search" : { "request" : { "indices" : "log-events", "body" : { "size" : 0, "query" : { "match" : { "status" : "error" } } } } } }, "condition" : { "compare" : { "ctx.payload.hits.total" : { "gt" : 5 }} }, "transform" : { "search" : { "request" : { "indices" : "log-events", "body" : { "query" : { "match" : { "status" : "error" } } } } } }, "actions" : { "my_webhook" : { "webhook" : { "method" : "POST", "host" : "mylisteninghost", "port" : 9200, "path" : "/{{watch_id}}", "body" : "Encountered {{ctx.payload.hits.total}} errors" } }, "email_administrator" : { "email" : { "to" : "sys.admino@host.domain", "subject" : "Encountered {{ctx.payload.hits.total}} errors", "body" : "Too many error in the system, see attached data", "attachments" : { "attached_data" : { "data" : { "format" : "json" } } }, "priority" : "high"
Leveraging GitOps for Scalable Alert Management
To ensure alerts were managed systematically, we built a Gitlab repository with a few initial features - we kicked off, some of which are mentioned below, the pipeline
Deploys any new alert rule logic added under the watcher folder structure.
Enables create, read, update, and delete (CRUD) operations on Watchers (selectively and en-masse! - yes we had that itch)
Mass Enable , or Disable all alert rules at once
Maintains version control, allowing teams to track changes and rollback if needed.
Requires approval flows to ensure human oversight before deploying new alert conditions.
Pipeline observability and notifications were integrated for operators , managers and business leaders accordingly to consume at required context levels.
Supports multi-platform observability across environments, business verticals, and metric types by adopting a clean, extensible folder structure
Repository Structure -
├── <business-vertical>/ <environment>/ # Environment layer (e.g., production/, staging/, dev/) ├── <business-vertical>/ # Platform/product vertical │ ├── business-metrics/ # Business-facing alerts (e.g., revenue drops, game errors) │ │ ├── <business-metric>-alert.json │ └── infrastructure/ # Infra-focused alerts (e.g., CPU, memory, container errors) │ ├── <infra-metric>
Key Principles:
By treating alerts as code, operations teams could submit pull requests to update alert logic, and changes would be reviewed before deployment. This created a structured workflow where alerting became a collaborative process rather than a one-time configuration task.
Toggle-able & Composable:
Each directory level (e.g.,
/<environment>/<product>/kibana/watcher
) is independently manageable, enabling selective deployments across products, environments, and metric domains.Decoupled Thresholds:
Each alert rule (watcher) contains its own threshold logic, allowing isolated adjustments without affecting unrelated alerts.
GitOps Workflow:
Pull Request: Validates alert format; optionally triggers dry-run deploys.
Merge: Auto-validates, deploys to Elasticsearch/Kibana, and notifies stakeholders.
Five Key Takeaways from This Journey
Observability is Not Just a Tooling Problem
It's easy to assume that adding more dashboards, more alerts, or another monitoring tool will solve the issue. The reality is that observability is about how teams interact with monitoring systems. If human operators are still manually inspecting dashboards, even the best tools won’t fix the problem. The key is to build automation that integrates seamlessly into existing workflows and reduces operational toil.
Understanding the business context is as important as the tech that powers it
Engineers often jump straight into solutions, but the first step should always be understanding how the business functions. We only realized the value of Elasticsearch Watchers after analyzing how teams monitored incidents and what metrics truly mattered. Without this context, we might have built an overly complex system that didn't serve the right purpose.
Your Alerts Need to Be Accountable
Sending alerts isn’t enough; teams need visibility into how and why alerts fired. We created a self-improving alerting system by writing alert events into a dedicated Elasticsearch index. But this worked only because we knew what to log and why.
As business use-cases evolve, they surface new metrics and anomalies worth tracking. These in turn expose gaps in existing logs. Every meaningful alert is a reflection of complete, context-rich logging. Logging, alerting, and business logic move in lockstep. This feedback loop is what transforms alerting from a reactive signal into a proactive intelligence layer.
GitOps Brings Structure, Operators bring Discipline and Sanity
Automating alert management through GitOps improved consistency and control, but its biggest value was in keeping humans in the loop. With version control, teams could track changes, roll back faulty alerts, and introduce approval flows to prevent misconfigured conditions. Automation works best when it provides flexibility while ensuring safety.
Extracting the Right Information is a Skill, Not Just a Process
Whether dealing with logs, alerts, or operational teams, the ability to extract relevant insights is what makes engineers effective. Most of the critical information for this project did not exist in documentation, it came from asking the right questions, observing pain points, and analyzing historical data.
The best engineers are not just problem-solvers; they are problem-finders!
At One2N, we do so much more, apply to work with us at https://one2n.in/careers
Observability is at the core of any modern Site Reliability Engineering (SRE) practice. When teams depend on people manually scanning dashboards to spot issues, they risk missing critical incidents. This is the story of how we turned a messy, manual, and reactive monitoring setup into an automated, organized, and scalable alerting framework built primarily on the Elasticsearch-Logstash-Kibana (ELK) stack and guided by GitOps principles.
Trial by fire - an unusual pattern spotted for a scale too large!
The existing monitoring system at our client’s infrastructure was fragmented across multiple tools and platforms. Logs and metrics scattered between ELK, AppDynamics, and Prometheus and human-led monitoring efforts, that required operators to manually inspect dashboards every 30 minutes and report for spikes, drops and flatline conditions. This showed us a trending pattern of alarming incident rates due to blind spots in alerting, human inability to monitor multiple dimensions across multiple dashboards at once and the overwhelming cognitive load on operations teams that occurs while incidents take place.
For the client we were working with, rougly about 80% of the critical data both application and system logs flowed into Elasticsearch. This made Elasticsearch the clear priority for building an automated alerting system that could deliver the punch required.
The scale we were looking at was pretty staggering:
20+ Elasticsearch nodes managing over 100 terabytes of log data,
4 TB of daily ingestion and rotation,
400+ indices constantly churning out real-time data
1200+ Kibana dashboards
2000+ visualization panels
With dashboards being manually inspected and no structured automation in place, the risk of missing critical issues was extremely high that teams were required to operate in shifts, covering with attention and reporting with responsibility. Relying solely on human monitoring to detect issues at this scale was unsustainable. Without intervention, business-impacting incidents would continue to slip through the cracks.
This is where One2N stepped in - Remember systems at scale is our jam, let’s continue.
Understanding the Problem - Identifying Gaps in Alerting
To solve this issue effectively, we first needed to understand the problem from both a technical and operational perspective. We asked some fundamental questions before attempting technical ones (like every sane engineer should)
What should be monitored?
What signals indicate real issues?
Who needs to be alerted, and how?
How can we reduce noise and focus on actionable insights?
What criteria or decision matrices are used to evaluate anomalies?
What became clear was that alerting wasn’t just about the tools, it was about how people interacted with systems and with each other. This was a team that hadn’t been previously exposed to technical troubleshooting, and rather only relied on spotting, thus it was natural to understand that they did not approach troubleshooting from that lens. While ELK was heavily relied upon for post incident debugging and reporting issues, there was no structured mechanism to ensure that critical issues were proactively detected and responded to in a manner that felt uniform across multiple occurrences (yes, we’re heading towards a runbook).
To streamline this, we had to first have an extremely clear picture of what needed to be monitored, how did the operations team do it, and how could our automation do it. We introduced a deliberately simple, almost boring method that captured “what to alert on” in a shared Excel sheet. It may seem unsophisticated, but it served as an effective collaborative contract that we could set with the team, a baseline agreement between engineering and operations that truly made the difference.
For every alert use case the team had in mind , this is how it was processed by us
Field | Value |
---|---|
Sr.No | |
Alert / Report Description | |
Dashboard link | |
Thresholds and Severity | |
Specs / Filters to Produce Results | |
Alert Deployed? | |
Reviewed by Ops Team | |
Final Sign Off by Ops Team | |
Review Comments (if any) |
Review of the alert stage - Meant that Ops team upon receiving the set alert, would conduct a detailed validation of the data calculation, condition evaluation and arrive at the conclusion that if they indeed received a valuable alert.
Final Sign Off Stage - Meant that, upon reviewing this alert for a week or so , they trust it enough to stop looking at dashboards (like a NOC team)
Choosing the Right Alerting Mechanism - Kibana Alerts vs. Watchers
Before committing to proposing a solution to fix this, we explored two built-in options available within the ELK ecosystem: Kibana Alerts and Elasticsearch Watchers. Furthermore, we landed on some open source solutions too, the most prominent one of them being ElastAlert (repo-link) a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch, designed by Yelp initially, and improvised by Jason Ertel.
However we soon realised that implementing something native (within ELK) and supported by Enterprise Support would be a greater win than relying on a community supported library (ElastAlert2) as an enterprise user.
The 2 options we evaluated are shared with our findings below:
Kibana Alerts: (built into Kibana)
Provided a straightforward UI and API driven approach to setting up alerts. They were easy to configure and could be triggered based on simple threshold-based conditions. However, we quickly found limitations:
Kibana Alerts could only operate on predefined conditions and pre-defined rules and lacked flexibility.
They struggled with historical data comparisons or tracking time boxed data over time. (now-1h , now-1d-1h) for example.
The conditions they could track were binary, or stepping changes in nature, or they couldn't handle multi-step logic sequences like: (run a high-level query → then aggregate over time buckets → then apply conditional checks on the results → triggering an action) and so on.
They lacked support for relative percentage-based calculations (e.g., a 20% drop compared to yesterday).
They excelled at incident management though, with built-in features like snoozing, recovery alerts, and alert state tracking.
So while they were great for basic cases and day-to-day triaging, we knew they wouldn’t hold up for the complex business logic we needed to model.
Kibana Alerts UI

Setting an Alert in Kibana -

Link to Kibana Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/alerts
Watchers: (built into Elasticsearch)
Elasticsearch Watchers, on the other hand, offered greater control over chaining the logic that leads to your alerting logic. For example, they
Support a range of input types, from simple static payloads to chaining multiple sources like HTTP calls or EQL searches all of which can feed into the evaluation logic.
Allow complex condition evaluation using the Painless scripting language (ironically named, because writing some of those scripts definitely wasn’t), helping bridge the gap between raw data and deciding whether to act on it.
Handling of relative drop conditions and comparisons with prior time buckets, which Kibana Alerts didn’t support.
That said, When compared from an Incident Lifecycle Management POV - Watchers don’t plug themselves into it as much as Kibana Alerts does. No snoozing, no built-in recovery notifications for Watchers can be tricky to handle because post-processing of alerts (when you continue to receive the same alert multiple times) now falls to end user. We can engineer this, but its effort.
Watchers to be explained in simpler terms, are dumb crons - think of them like powerful background jobs that query, evaluate, and act but they don’t carry state or support workflows beyond their configured action. While they can be acknowledged, no new alerts are fired until the original condition resets, which can be tricky to configure if you have multi-dimensional items to report on (For example - stocks falling below 5% value) , can be 2 different lists Alert Body A (Stocks - 1,2,3) and Alert Body B (Stocks - 4,5,6) obtained if run at two consecutive 1 hour intervals. But for the first (alert 1) to recover, we need exactly (Stock - 1,2,3) have to raise above 5% which necessarily might not be the case.
Watcher UI -

Watcher Firing History

Link to Watcher Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/watcher
After evaluating both options, we decided that Watchers were the best fit for our needs because they gave us the flexibility to create alerts that were deeply integrated with the way our business operated.
When to use what offering?
If we needed an alert when users dropped below 1000 in the last hour - User Kibana Alerts
If we needed an alert when users from the US, Germany, and India dropped below 1000 in the last hour compared to yesterday - Use Elasticseach Watchers - because Kibana Alerts does not understand what “compared to yesterday” means, or how to calculate such a value.
How We Used Elasticsearch Watchers to Automate Alerting
After scoping our future alerting requirements, we aligned on Elasticsearch Watchers as the foundation for building alert conditions. This gave us the flexibility to define logic around business-specific metrics, compare against historical trends, and schedule alerts based on real-world usage patterns. Alerts weren’t just triggered; they were tied to signals that actually mattered to the business.
To make this framework operationally insightful, we designed every alert to log its payload into a dedicated Elasticsearch index. Each log carried a unique watcher_id, which acted as a traceable fingerprint for that alert. This served as the system’s single source of truth, allowing us to track how often alerts fired, identify noisy patterns, and tune thresholds over time.
What began as a simple alerting pipeline gradually matured into an intelligent feedback loop. By analyzing the alert history, we were able to detect trends, reduce false positives, and refine conditions with each iteration. The system not only scaled but also improved continuously.
Built-in Alert Insights - We were able to query our own alerts data over (24h, 1d, 1w) and so on!

Garbage In , Treasure Out - Seems Not!
Example of An Elasticsearch Watcher Payload Body:
{ "metadata" : { "color" : "red" }, "trigger" : { "schedule" : { "interval" : "5m" } }, "input" : { "search" : { "request" : { "indices" : "log-events", "body" : { "size" : 0, "query" : { "match" : { "status" : "error" } } } } } }, "condition" : { "compare" : { "ctx.payload.hits.total" : { "gt" : 5 }} }, "transform" : { "search" : { "request" : { "indices" : "log-events", "body" : { "query" : { "match" : { "status" : "error" } } } } } }, "actions" : { "my_webhook" : { "webhook" : { "method" : "POST", "host" : "mylisteninghost", "port" : 9200, "path" : "/{{watch_id}}", "body" : "Encountered {{ctx.payload.hits.total}} errors" } }, "email_administrator" : { "email" : { "to" : "sys.admino@host.domain", "subject" : "Encountered {{ctx.payload.hits.total}} errors", "body" : "Too many error in the system, see attached data", "attachments" : { "attached_data" : { "data" : { "format" : "json" } } }, "priority" : "high"
Leveraging GitOps for Scalable Alert Management
To ensure alerts were managed systematically, we built a Gitlab repository with a few initial features - we kicked off, some of which are mentioned below, the pipeline
Deploys any new alert rule logic added under the watcher folder structure.
Enables create, read, update, and delete (CRUD) operations on Watchers (selectively and en-masse! - yes we had that itch)
Mass Enable , or Disable all alert rules at once
Maintains version control, allowing teams to track changes and rollback if needed.
Requires approval flows to ensure human oversight before deploying new alert conditions.
Pipeline observability and notifications were integrated for operators , managers and business leaders accordingly to consume at required context levels.
Supports multi-platform observability across environments, business verticals, and metric types by adopting a clean, extensible folder structure
Repository Structure -
├── <business-vertical>/ <environment>/ # Environment layer (e.g., production/, staging/, dev/) ├── <business-vertical>/ # Platform/product vertical │ ├── business-metrics/ # Business-facing alerts (e.g., revenue drops, game errors) │ │ ├── <business-metric>-alert.json │ └── infrastructure/ # Infra-focused alerts (e.g., CPU, memory, container errors) │ ├── <infra-metric>
Key Principles:
By treating alerts as code, operations teams could submit pull requests to update alert logic, and changes would be reviewed before deployment. This created a structured workflow where alerting became a collaborative process rather than a one-time configuration task.
Toggle-able & Composable:
Each directory level (e.g.,
/<environment>/<product>/kibana/watcher
) is independently manageable, enabling selective deployments across products, environments, and metric domains.Decoupled Thresholds:
Each alert rule (watcher) contains its own threshold logic, allowing isolated adjustments without affecting unrelated alerts.
GitOps Workflow:
Pull Request: Validates alert format; optionally triggers dry-run deploys.
Merge: Auto-validates, deploys to Elasticsearch/Kibana, and notifies stakeholders.
Five Key Takeaways from This Journey
Observability is Not Just a Tooling Problem
It's easy to assume that adding more dashboards, more alerts, or another monitoring tool will solve the issue. The reality is that observability is about how teams interact with monitoring systems. If human operators are still manually inspecting dashboards, even the best tools won’t fix the problem. The key is to build automation that integrates seamlessly into existing workflows and reduces operational toil.
Understanding the business context is as important as the tech that powers it
Engineers often jump straight into solutions, but the first step should always be understanding how the business functions. We only realized the value of Elasticsearch Watchers after analyzing how teams monitored incidents and what metrics truly mattered. Without this context, we might have built an overly complex system that didn't serve the right purpose.
Your Alerts Need to Be Accountable
Sending alerts isn’t enough; teams need visibility into how and why alerts fired. We created a self-improving alerting system by writing alert events into a dedicated Elasticsearch index. But this worked only because we knew what to log and why.
As business use-cases evolve, they surface new metrics and anomalies worth tracking. These in turn expose gaps in existing logs. Every meaningful alert is a reflection of complete, context-rich logging. Logging, alerting, and business logic move in lockstep. This feedback loop is what transforms alerting from a reactive signal into a proactive intelligence layer.
GitOps Brings Structure, Operators bring Discipline and Sanity
Automating alert management through GitOps improved consistency and control, but its biggest value was in keeping humans in the loop. With version control, teams could track changes, roll back faulty alerts, and introduce approval flows to prevent misconfigured conditions. Automation works best when it provides flexibility while ensuring safety.
Extracting the Right Information is a Skill, Not Just a Process
Whether dealing with logs, alerts, or operational teams, the ability to extract relevant insights is what makes engineers effective. Most of the critical information for this project did not exist in documentation, it came from asking the right questions, observing pain points, and analyzing historical data.
The best engineers are not just problem-solvers; they are problem-finders!
At One2N, we do so much more, apply to work with us at https://one2n.in/careers
Observability is at the core of any modern Site Reliability Engineering (SRE) practice. When teams depend on people manually scanning dashboards to spot issues, they risk missing critical incidents. This is the story of how we turned a messy, manual, and reactive monitoring setup into an automated, organized, and scalable alerting framework built primarily on the Elasticsearch-Logstash-Kibana (ELK) stack and guided by GitOps principles.
Trial by fire - an unusual pattern spotted for a scale too large!
The existing monitoring system at our client’s infrastructure was fragmented across multiple tools and platforms. Logs and metrics scattered between ELK, AppDynamics, and Prometheus and human-led monitoring efforts, that required operators to manually inspect dashboards every 30 minutes and report for spikes, drops and flatline conditions. This showed us a trending pattern of alarming incident rates due to blind spots in alerting, human inability to monitor multiple dimensions across multiple dashboards at once and the overwhelming cognitive load on operations teams that occurs while incidents take place.
For the client we were working with, rougly about 80% of the critical data both application and system logs flowed into Elasticsearch. This made Elasticsearch the clear priority for building an automated alerting system that could deliver the punch required.
The scale we were looking at was pretty staggering:
20+ Elasticsearch nodes managing over 100 terabytes of log data,
4 TB of daily ingestion and rotation,
400+ indices constantly churning out real-time data
1200+ Kibana dashboards
2000+ visualization panels
With dashboards being manually inspected and no structured automation in place, the risk of missing critical issues was extremely high that teams were required to operate in shifts, covering with attention and reporting with responsibility. Relying solely on human monitoring to detect issues at this scale was unsustainable. Without intervention, business-impacting incidents would continue to slip through the cracks.
This is where One2N stepped in - Remember systems at scale is our jam, let’s continue.
Understanding the Problem - Identifying Gaps in Alerting
To solve this issue effectively, we first needed to understand the problem from both a technical and operational perspective. We asked some fundamental questions before attempting technical ones (like every sane engineer should)
What should be monitored?
What signals indicate real issues?
Who needs to be alerted, and how?
How can we reduce noise and focus on actionable insights?
What criteria or decision matrices are used to evaluate anomalies?
What became clear was that alerting wasn’t just about the tools, it was about how people interacted with systems and with each other. This was a team that hadn’t been previously exposed to technical troubleshooting, and rather only relied on spotting, thus it was natural to understand that they did not approach troubleshooting from that lens. While ELK was heavily relied upon for post incident debugging and reporting issues, there was no structured mechanism to ensure that critical issues were proactively detected and responded to in a manner that felt uniform across multiple occurrences (yes, we’re heading towards a runbook).
To streamline this, we had to first have an extremely clear picture of what needed to be monitored, how did the operations team do it, and how could our automation do it. We introduced a deliberately simple, almost boring method that captured “what to alert on” in a shared Excel sheet. It may seem unsophisticated, but it served as an effective collaborative contract that we could set with the team, a baseline agreement between engineering and operations that truly made the difference.
For every alert use case the team had in mind , this is how it was processed by us
Field | Value |
---|---|
Sr.No | |
Alert / Report Description | |
Dashboard link | |
Thresholds and Severity | |
Specs / Filters to Produce Results | |
Alert Deployed? | |
Reviewed by Ops Team | |
Final Sign Off by Ops Team | |
Review Comments (if any) |
Review of the alert stage - Meant that Ops team upon receiving the set alert, would conduct a detailed validation of the data calculation, condition evaluation and arrive at the conclusion that if they indeed received a valuable alert.
Final Sign Off Stage - Meant that, upon reviewing this alert for a week or so , they trust it enough to stop looking at dashboards (like a NOC team)
Choosing the Right Alerting Mechanism - Kibana Alerts vs. Watchers
Before committing to proposing a solution to fix this, we explored two built-in options available within the ELK ecosystem: Kibana Alerts and Elasticsearch Watchers. Furthermore, we landed on some open source solutions too, the most prominent one of them being ElastAlert (repo-link) a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch, designed by Yelp initially, and improvised by Jason Ertel.
However we soon realised that implementing something native (within ELK) and supported by Enterprise Support would be a greater win than relying on a community supported library (ElastAlert2) as an enterprise user.
The 2 options we evaluated are shared with our findings below:
Kibana Alerts: (built into Kibana)
Provided a straightforward UI and API driven approach to setting up alerts. They were easy to configure and could be triggered based on simple threshold-based conditions. However, we quickly found limitations:
Kibana Alerts could only operate on predefined conditions and pre-defined rules and lacked flexibility.
They struggled with historical data comparisons or tracking time boxed data over time. (now-1h , now-1d-1h) for example.
The conditions they could track were binary, or stepping changes in nature, or they couldn't handle multi-step logic sequences like: (run a high-level query → then aggregate over time buckets → then apply conditional checks on the results → triggering an action) and so on.
They lacked support for relative percentage-based calculations (e.g., a 20% drop compared to yesterday).
They excelled at incident management though, with built-in features like snoozing, recovery alerts, and alert state tracking.
So while they were great for basic cases and day-to-day triaging, we knew they wouldn’t hold up for the complex business logic we needed to model.
Kibana Alerts UI

Setting an Alert in Kibana -

Link to Kibana Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/alerts
Watchers: (built into Elasticsearch)
Elasticsearch Watchers, on the other hand, offered greater control over chaining the logic that leads to your alerting logic. For example, they
Support a range of input types, from simple static payloads to chaining multiple sources like HTTP calls or EQL searches all of which can feed into the evaluation logic.
Allow complex condition evaluation using the Painless scripting language (ironically named, because writing some of those scripts definitely wasn’t), helping bridge the gap between raw data and deciding whether to act on it.
Handling of relative drop conditions and comparisons with prior time buckets, which Kibana Alerts didn’t support.
That said, When compared from an Incident Lifecycle Management POV - Watchers don’t plug themselves into it as much as Kibana Alerts does. No snoozing, no built-in recovery notifications for Watchers can be tricky to handle because post-processing of alerts (when you continue to receive the same alert multiple times) now falls to end user. We can engineer this, but its effort.
Watchers to be explained in simpler terms, are dumb crons - think of them like powerful background jobs that query, evaluate, and act but they don’t carry state or support workflows beyond their configured action. While they can be acknowledged, no new alerts are fired until the original condition resets, which can be tricky to configure if you have multi-dimensional items to report on (For example - stocks falling below 5% value) , can be 2 different lists Alert Body A (Stocks - 1,2,3) and Alert Body B (Stocks - 4,5,6) obtained if run at two consecutive 1 hour intervals. But for the first (alert 1) to recover, we need exactly (Stock - 1,2,3) have to raise above 5% which necessarily might not be the case.
Watcher UI -

Watcher Firing History

Link to Watcher Docs - https://www.elastic.co/docs/explore-analyze/alerts-cases/watcher
After evaluating both options, we decided that Watchers were the best fit for our needs because they gave us the flexibility to create alerts that were deeply integrated with the way our business operated.
When to use what offering?
If we needed an alert when users dropped below 1000 in the last hour - User Kibana Alerts
If we needed an alert when users from the US, Germany, and India dropped below 1000 in the last hour compared to yesterday - Use Elasticseach Watchers - because Kibana Alerts does not understand what “compared to yesterday” means, or how to calculate such a value.
How We Used Elasticsearch Watchers to Automate Alerting
After scoping our future alerting requirements, we aligned on Elasticsearch Watchers as the foundation for building alert conditions. This gave us the flexibility to define logic around business-specific metrics, compare against historical trends, and schedule alerts based on real-world usage patterns. Alerts weren’t just triggered; they were tied to signals that actually mattered to the business.
To make this framework operationally insightful, we designed every alert to log its payload into a dedicated Elasticsearch index. Each log carried a unique watcher_id, which acted as a traceable fingerprint for that alert. This served as the system’s single source of truth, allowing us to track how often alerts fired, identify noisy patterns, and tune thresholds over time.
What began as a simple alerting pipeline gradually matured into an intelligent feedback loop. By analyzing the alert history, we were able to detect trends, reduce false positives, and refine conditions with each iteration. The system not only scaled but also improved continuously.
Built-in Alert Insights - We were able to query our own alerts data over (24h, 1d, 1w) and so on!

Garbage In , Treasure Out - Seems Not!
Example of An Elasticsearch Watcher Payload Body:
{ "metadata" : { "color" : "red" }, "trigger" : { "schedule" : { "interval" : "5m" } }, "input" : { "search" : { "request" : { "indices" : "log-events", "body" : { "size" : 0, "query" : { "match" : { "status" : "error" } } } } } }, "condition" : { "compare" : { "ctx.payload.hits.total" : { "gt" : 5 }} }, "transform" : { "search" : { "request" : { "indices" : "log-events", "body" : { "query" : { "match" : { "status" : "error" } } } } } }, "actions" : { "my_webhook" : { "webhook" : { "method" : "POST", "host" : "mylisteninghost", "port" : 9200, "path" : "/{{watch_id}}", "body" : "Encountered {{ctx.payload.hits.total}} errors" } }, "email_administrator" : { "email" : { "to" : "sys.admino@host.domain", "subject" : "Encountered {{ctx.payload.hits.total}} errors", "body" : "Too many error in the system, see attached data", "attachments" : { "attached_data" : { "data" : { "format" : "json" } } }, "priority" : "high"
Leveraging GitOps for Scalable Alert Management
To ensure alerts were managed systematically, we built a Gitlab repository with a few initial features - we kicked off, some of which are mentioned below, the pipeline
Deploys any new alert rule logic added under the watcher folder structure.
Enables create, read, update, and delete (CRUD) operations on Watchers (selectively and en-masse! - yes we had that itch)
Mass Enable , or Disable all alert rules at once
Maintains version control, allowing teams to track changes and rollback if needed.
Requires approval flows to ensure human oversight before deploying new alert conditions.
Pipeline observability and notifications were integrated for operators , managers and business leaders accordingly to consume at required context levels.
Supports multi-platform observability across environments, business verticals, and metric types by adopting a clean, extensible folder structure
Repository Structure -
├── <business-vertical>/ <environment>/ # Environment layer (e.g., production/, staging/, dev/) ├── <business-vertical>/ # Platform/product vertical │ ├── business-metrics/ # Business-facing alerts (e.g., revenue drops, game errors) │ │ ├── <business-metric>-alert.json │ └── infrastructure/ # Infra-focused alerts (e.g., CPU, memory, container errors) │ ├── <infra-metric>
Key Principles:
By treating alerts as code, operations teams could submit pull requests to update alert logic, and changes would be reviewed before deployment. This created a structured workflow where alerting became a collaborative process rather than a one-time configuration task.
Toggle-able & Composable:
Each directory level (e.g.,
/<environment>/<product>/kibana/watcher
) is independently manageable, enabling selective deployments across products, environments, and metric domains.Decoupled Thresholds:
Each alert rule (watcher) contains its own threshold logic, allowing isolated adjustments without affecting unrelated alerts.
GitOps Workflow:
Pull Request: Validates alert format; optionally triggers dry-run deploys.
Merge: Auto-validates, deploys to Elasticsearch/Kibana, and notifies stakeholders.
Five Key Takeaways from This Journey
Observability is Not Just a Tooling Problem
It's easy to assume that adding more dashboards, more alerts, or another monitoring tool will solve the issue. The reality is that observability is about how teams interact with monitoring systems. If human operators are still manually inspecting dashboards, even the best tools won’t fix the problem. The key is to build automation that integrates seamlessly into existing workflows and reduces operational toil.
Understanding the business context is as important as the tech that powers it
Engineers often jump straight into solutions, but the first step should always be understanding how the business functions. We only realized the value of Elasticsearch Watchers after analyzing how teams monitored incidents and what metrics truly mattered. Without this context, we might have built an overly complex system that didn't serve the right purpose.
Your Alerts Need to Be Accountable
Sending alerts isn’t enough; teams need visibility into how and why alerts fired. We created a self-improving alerting system by writing alert events into a dedicated Elasticsearch index. But this worked only because we knew what to log and why.
As business use-cases evolve, they surface new metrics and anomalies worth tracking. These in turn expose gaps in existing logs. Every meaningful alert is a reflection of complete, context-rich logging. Logging, alerting, and business logic move in lockstep. This feedback loop is what transforms alerting from a reactive signal into a proactive intelligence layer.
GitOps Brings Structure, Operators bring Discipline and Sanity
Automating alert management through GitOps improved consistency and control, but its biggest value was in keeping humans in the loop. With version control, teams could track changes, roll back faulty alerts, and introduce approval flows to prevent misconfigured conditions. Automation works best when it provides flexibility while ensuring safety.
Extracting the Right Information is a Skill, Not Just a Process
Whether dealing with logs, alerts, or operational teams, the ability to extract relevant insights is what makes engineers effective. Most of the critical information for this project did not exist in documentation, it came from asking the right questions, observing pain points, and analyzing historical data.
The best engineers are not just problem-solvers; they are problem-finders!
At One2N, we do so much more, apply to work with us at https://one2n.in/careers