Services

Resources

Company

Aug 20, 2025 | 5 min read

Observability Zero to One: A pragmatic guide to OpenTelemetry

Observability Zero to One: A pragmatic guide to OpenTelemetry

Aug 20, 2025 | 5 min read

Observability Zero to One: A pragmatic guide to OpenTelemetry

Aug 20, 2025 | 5 min read

Observability Zero to One: A pragmatic guide to OpenTelemetry

Aug 20, 2025 | 5 min read

Observability Zero to One: A pragmatic guide to OpenTelemetry

The growing pain of modern systems

Picture this: It’s 3 AM and PagerDuty screams. The alert reads: High API Error Rate on Payments Service. Your heart sinks. You’re awake, you’re looking at a dashboard, and you can confirm that yes, the 5xx error rate is through the roof. The monitoring system has done its job; it told you what is broken. But now the real work begins. Why is it broken? Is it a bad deploy? A database connection pool exhaustion? A downstream service failing?

This is the daily reality for a Site Reliability Engineer (SRE). In this world, knowing that something is wrong is just the first, and frankly, easiest step. The real challenge lies in quickly understanding the root cause across a complex web of microservices.

This is where we move beyond simple monitoring into observability. It’s the difference between knowing your car’s engine light is on and having the full diagnostic data to know exactly which sensor failed and why.

Observability vs. monitoring: what's the difference?

The terms "monitoring" and "observability" are often used interchangeably, but they represent a critical evolution in how we manage systems. Understanding the distinction is fundamental for any SRE.

Monitoring is the practice of collecting and analyzing data based on a predefined set of metrics and logs. It’s about watching for problems you already know can happen.

💡 Monitoring answers questions about known-knowns, failure modes we have anticipated and built dashboards for.

It tells you when something is wrong and what is wrong, like "CPU utilization is at 90%" or "API latency is over 500ms".

Observability, on the other hand, is a property of a system that allows you to understand its internal state by examining its external outputs.

💡 While monitoring tells you that something is wrong, observability gives you the power to explore and ask why it’s wrong. It’s designed to help you debug the unknown-unknowns, novel problems you couldn’t have predicted.

Observability doesn't just look at individual components; it provides a holistic view of the entire distributed system, enabling you to trace issues across service boundaries to find the true root cause.

Crucially, monitoring is a foundational part of observability. You can't have an observable system without first monitoring it to generate the necessary data.

The three pillars of observability

Observability is built on three core types of telemetry data, often called the "three pillars":

  1. Metrics: Numerical measurements aggregated over time. They are the vital signs of your application: CPU usage, request rates, error counts. Metrics are efficient and great for dashboards and alerting on thresholds.

  2. Logs: Timestamped, immutable records of discrete events. They provide detailed, granular context about a specific error or operation, telling you the specifics of what happened at a particular moment.

  3. Traces: The cornerstone of observability in distributed systems. A trace represents the complete, end-to-end journey of a single request as it propagates through multiple microservices. Traces answer the crucial question of where a problem occurred in a complex workflow.

By unifying these three signals, an SRE can go from a high-level alert (a metric) to the specific error message (a log) and see exactly which downstream service call failed and caused the error (a trace).

The problem: a world of many agents

For years, the path to collecting this telemetry data was fragmented. If you wanted metrics, you might use a Prometheus exporter. For logs, a Fluentd agent. For traces, you’d install a proprietary APM agent from a vendor like Datadog, New Relic, or AppDynamics. Each tool had its own agent, its own configuration, and its own data format.

This created a "tower of babel" for telemetry. Engineering teams were burdened with managing multiple agents, and data was siloed in different backends. Worst of all, it created deep vendor lock-in. If you instrumented your entire codebase with one vendor's agent, switching to another was a massive, cost-prohibitive undertaking that required re-instrumenting every single service.

The solution: how OpenTelemetry came to be

The open-source community recognized this problem and produced two parallel projects: OpenTracing, which focused on a standard API for tracing, and OpenCensus, which provided libraries for collecting both traces and metrics. While both were steps in the right direction, they split the community.

In 2019, these two projects merged under the Cloud Native Computing Foundation (CNCF) to form OpenTelemetry (OTEL). OTEL combined the strengths of both, creating a single, unified, open-source standard for all telemetry data.

What exactly is OpenTelemetry?

OpenTelemetry is an open-source observability framework comprising a collection of APIs, SDKs, and tools designed to standardize the generation, collection, and exportation of telemetry data: metrics, logs, and traces.

It is critical to understand what OpenTelemetry is not. OTEL is not an observability backend. It does not provide a database for storing data or a user interface for visualizing it. Instead, OTEL is the transport layer. It is the standardized plumbing that decouples your application's instrumentation from the analysis tool you choose. It focuses on three key jobs:

  1. Instrument: Providing the APIs and SDKs to generate telemetry from your code.

  2. Pipeline: Offering the OpenTelemetry Collector to process and route that data.

  3. Connect: Using exporters to send the data to a sink (a backend like Jaeger, Prometheus, or Datadog).

Why OTEL matters: the value of no lock-in

For a pragmatic SRE, the most significant benefit of OpenTelemetry is strategic: it eliminates vendor lock-in. Because your code is instrumented against a vendor-neutral standard, you regain control over your data and your tooling choices.

Consider a common scenario: A company is using a premium, all-in-one observability platform. It's powerful but expensive. For their production environment, the cost is justified. But for staging, development, and testing environments, they are paying for features they don't need. They want to use a more cost-effective open-source stack like Prometheus and Grafana for non-prod environments to save money.

  • The Old Way (Proprietary Agents): You would have to maintain two separate instrumentation codebases in every single application, one for the commercial vendor and one for Prometheus. This is unmaintainable and would never get done.

  • The OTEL Way: With OpenTelemetry, this becomes a simple configuration change. The application code is instrumented only once with the OTEL SDK. The magic happens in the OpenTelemetry Collector, which can be configured to route telemetry based on its attributes. You can set up a pipeline that says, "If environment=production, send data to the Datadog exporter. If environment=staging, send data to the Prometheus exporter." No application code changes are needed.

This flexibility gives SREs the ability to choose the best tool for the job and optimize for both performance and cost.

A balanced view: the trade-offs of OpenTelemetry

While the flexibility of OpenTelemetry is a significant advantage, it's not a silver bullet. This freedom comes with increased responsibility. Proprietary agents, like those from Datadog, often provide more extensive "out-of-the-box" auto-instrumentation that covers a wide array of technologies with minimal setup.

With OpenTelemetry, while auto-instrumentation for common frameworks is available, you may find yourself needing to write more manual instrumentation to achieve the same depth of visibility. This is the core trade-off: you gain vendor neutrality at the potential cost of convenience. Committing to OTEL means you might lose access to certain tightly integrated, vendor-specific features that work seamlessly with their proprietary agents. It's a strategic decision that weighs the long-term benefit of control against the short-term benefit of a more managed, all-in-one solution.

How OTEL proved its worth in a real-world migration scenario.

Let's look at a more concrete example of OTEL's strategic power. A fast-growing fintech company was using AppDynamics for their Application Performance Monitoring. It was powerful, but as they scaled, the costs became astronomical. They decided to migrate to Datadog, which offered a more flexible pricing model for their needs.

The Challenge: The company had hundreds of microservices, all deeply instrumented with proprietary AppDynamics agents. The engineering cost to rip out the old instrumentation and replace it with Datadog's agents across every service was estimated to be thousands of hours, a project so large and risky it was a non-starter.

An OTEL-based approach: Instead of a "big bang" re-instrumentation, they adopted OpenTelemetry. Their migration became a phased, manageable process:

  1. Standardize on OTEL: All new services were instrumented using the vendor-neutral OpenTelemetry SDKs.

  2. Deploy the Collector: They deployed the OpenTelemetry Collector and configured it to receive data in multiple formats.

  3. Dual-Exporting: For a transition period, they configured the Collector to export telemetry to both AppDynamics and Datadog simultaneously. It's important to note that this strategy was applied incrementally, starting with a small set of critical services. Sending all telemetry from all services to two external backends can significantly increase data egress costs, so a phased approach is crucial. This allowed them to build out their new Datadog dashboards and alerts and validate them against the old system in real-time, reducing the risk of losing visibility without incurring excessive costs.

  4. Migrate Incrementally: They gradually replaced the legacy AppDynamics agents in older services with OTEL instrumentation at their own pace, during regular maintenance cycles.

  5. Flip the Switch: Once the migration was complete and validated, they simply removed the AppDynamics exporter from their Collector configuration. The final cutover was a low-risk, controlled event.

The Outcome: This phased approach allowed the migration to be completed with predictable, managed risk and far less engineering disruption than a "rip and replace" project would have entailed. More importantly, they successfully broke their vendor lock-in. They now have the freedom to use Datadog for production while using a self-hosted Prometheus stack for development, all controlled via a central Collector configuration, giving them greater control over their observability strategy and costs.

What's next?

This guide is just the tip of the iceberg for OpenTelemetry. Now that you understand the "why," and "what," you can start exploring more advanced topics.

In upcoming posts, we'll be exploring:

  • The nitty-gritties of OTEL collector processors

  • Otel deployment patterns

Stay tuned to continue your journey on learning OpenTelemetry from ground up.

The growing pain of modern systems

Picture this: It’s 3 AM and PagerDuty screams. The alert reads: High API Error Rate on Payments Service. Your heart sinks. You’re awake, you’re looking at a dashboard, and you can confirm that yes, the 5xx error rate is through the roof. The monitoring system has done its job; it told you what is broken. But now the real work begins. Why is it broken? Is it a bad deploy? A database connection pool exhaustion? A downstream service failing?

This is the daily reality for a Site Reliability Engineer (SRE). In this world, knowing that something is wrong is just the first, and frankly, easiest step. The real challenge lies in quickly understanding the root cause across a complex web of microservices.

This is where we move beyond simple monitoring into observability. It’s the difference between knowing your car’s engine light is on and having the full diagnostic data to know exactly which sensor failed and why.

Observability vs. monitoring: what's the difference?

The terms "monitoring" and "observability" are often used interchangeably, but they represent a critical evolution in how we manage systems. Understanding the distinction is fundamental for any SRE.

Monitoring is the practice of collecting and analyzing data based on a predefined set of metrics and logs. It’s about watching for problems you already know can happen.

💡 Monitoring answers questions about known-knowns, failure modes we have anticipated and built dashboards for.

It tells you when something is wrong and what is wrong, like "CPU utilization is at 90%" or "API latency is over 500ms".

Observability, on the other hand, is a property of a system that allows you to understand its internal state by examining its external outputs.

💡 While monitoring tells you that something is wrong, observability gives you the power to explore and ask why it’s wrong. It’s designed to help you debug the unknown-unknowns, novel problems you couldn’t have predicted.

Observability doesn't just look at individual components; it provides a holistic view of the entire distributed system, enabling you to trace issues across service boundaries to find the true root cause.

Crucially, monitoring is a foundational part of observability. You can't have an observable system without first monitoring it to generate the necessary data.

The three pillars of observability

Observability is built on three core types of telemetry data, often called the "three pillars":

  1. Metrics: Numerical measurements aggregated over time. They are the vital signs of your application: CPU usage, request rates, error counts. Metrics are efficient and great for dashboards and alerting on thresholds.

  2. Logs: Timestamped, immutable records of discrete events. They provide detailed, granular context about a specific error or operation, telling you the specifics of what happened at a particular moment.

  3. Traces: The cornerstone of observability in distributed systems. A trace represents the complete, end-to-end journey of a single request as it propagates through multiple microservices. Traces answer the crucial question of where a problem occurred in a complex workflow.

By unifying these three signals, an SRE can go from a high-level alert (a metric) to the specific error message (a log) and see exactly which downstream service call failed and caused the error (a trace).

The problem: a world of many agents

For years, the path to collecting this telemetry data was fragmented. If you wanted metrics, you might use a Prometheus exporter. For logs, a Fluentd agent. For traces, you’d install a proprietary APM agent from a vendor like Datadog, New Relic, or AppDynamics. Each tool had its own agent, its own configuration, and its own data format.

This created a "tower of babel" for telemetry. Engineering teams were burdened with managing multiple agents, and data was siloed in different backends. Worst of all, it created deep vendor lock-in. If you instrumented your entire codebase with one vendor's agent, switching to another was a massive, cost-prohibitive undertaking that required re-instrumenting every single service.

The solution: how OpenTelemetry came to be

The open-source community recognized this problem and produced two parallel projects: OpenTracing, which focused on a standard API for tracing, and OpenCensus, which provided libraries for collecting both traces and metrics. While both were steps in the right direction, they split the community.

In 2019, these two projects merged under the Cloud Native Computing Foundation (CNCF) to form OpenTelemetry (OTEL). OTEL combined the strengths of both, creating a single, unified, open-source standard for all telemetry data.

What exactly is OpenTelemetry?

OpenTelemetry is an open-source observability framework comprising a collection of APIs, SDKs, and tools designed to standardize the generation, collection, and exportation of telemetry data: metrics, logs, and traces.

It is critical to understand what OpenTelemetry is not. OTEL is not an observability backend. It does not provide a database for storing data or a user interface for visualizing it. Instead, OTEL is the transport layer. It is the standardized plumbing that decouples your application's instrumentation from the analysis tool you choose. It focuses on three key jobs:

  1. Instrument: Providing the APIs and SDKs to generate telemetry from your code.

  2. Pipeline: Offering the OpenTelemetry Collector to process and route that data.

  3. Connect: Using exporters to send the data to a sink (a backend like Jaeger, Prometheus, or Datadog).

Why OTEL matters: the value of no lock-in

For a pragmatic SRE, the most significant benefit of OpenTelemetry is strategic: it eliminates vendor lock-in. Because your code is instrumented against a vendor-neutral standard, you regain control over your data and your tooling choices.

Consider a common scenario: A company is using a premium, all-in-one observability platform. It's powerful but expensive. For their production environment, the cost is justified. But for staging, development, and testing environments, they are paying for features they don't need. They want to use a more cost-effective open-source stack like Prometheus and Grafana for non-prod environments to save money.

  • The Old Way (Proprietary Agents): You would have to maintain two separate instrumentation codebases in every single application, one for the commercial vendor and one for Prometheus. This is unmaintainable and would never get done.

  • The OTEL Way: With OpenTelemetry, this becomes a simple configuration change. The application code is instrumented only once with the OTEL SDK. The magic happens in the OpenTelemetry Collector, which can be configured to route telemetry based on its attributes. You can set up a pipeline that says, "If environment=production, send data to the Datadog exporter. If environment=staging, send data to the Prometheus exporter." No application code changes are needed.

This flexibility gives SREs the ability to choose the best tool for the job and optimize for both performance and cost.

A balanced view: the trade-offs of OpenTelemetry

While the flexibility of OpenTelemetry is a significant advantage, it's not a silver bullet. This freedom comes with increased responsibility. Proprietary agents, like those from Datadog, often provide more extensive "out-of-the-box" auto-instrumentation that covers a wide array of technologies with minimal setup.

With OpenTelemetry, while auto-instrumentation for common frameworks is available, you may find yourself needing to write more manual instrumentation to achieve the same depth of visibility. This is the core trade-off: you gain vendor neutrality at the potential cost of convenience. Committing to OTEL means you might lose access to certain tightly integrated, vendor-specific features that work seamlessly with their proprietary agents. It's a strategic decision that weighs the long-term benefit of control against the short-term benefit of a more managed, all-in-one solution.

How OTEL proved its worth in a real-world migration scenario.

Let's look at a more concrete example of OTEL's strategic power. A fast-growing fintech company was using AppDynamics for their Application Performance Monitoring. It was powerful, but as they scaled, the costs became astronomical. They decided to migrate to Datadog, which offered a more flexible pricing model for their needs.

The Challenge: The company had hundreds of microservices, all deeply instrumented with proprietary AppDynamics agents. The engineering cost to rip out the old instrumentation and replace it with Datadog's agents across every service was estimated to be thousands of hours, a project so large and risky it was a non-starter.

An OTEL-based approach: Instead of a "big bang" re-instrumentation, they adopted OpenTelemetry. Their migration became a phased, manageable process:

  1. Standardize on OTEL: All new services were instrumented using the vendor-neutral OpenTelemetry SDKs.

  2. Deploy the Collector: They deployed the OpenTelemetry Collector and configured it to receive data in multiple formats.

  3. Dual-Exporting: For a transition period, they configured the Collector to export telemetry to both AppDynamics and Datadog simultaneously. It's important to note that this strategy was applied incrementally, starting with a small set of critical services. Sending all telemetry from all services to two external backends can significantly increase data egress costs, so a phased approach is crucial. This allowed them to build out their new Datadog dashboards and alerts and validate them against the old system in real-time, reducing the risk of losing visibility without incurring excessive costs.

  4. Migrate Incrementally: They gradually replaced the legacy AppDynamics agents in older services with OTEL instrumentation at their own pace, during regular maintenance cycles.

  5. Flip the Switch: Once the migration was complete and validated, they simply removed the AppDynamics exporter from their Collector configuration. The final cutover was a low-risk, controlled event.

The Outcome: This phased approach allowed the migration to be completed with predictable, managed risk and far less engineering disruption than a "rip and replace" project would have entailed. More importantly, they successfully broke their vendor lock-in. They now have the freedom to use Datadog for production while using a self-hosted Prometheus stack for development, all controlled via a central Collector configuration, giving them greater control over their observability strategy and costs.

What's next?

This guide is just the tip of the iceberg for OpenTelemetry. Now that you understand the "why," and "what," you can start exploring more advanced topics.

In upcoming posts, we'll be exploring:

  • The nitty-gritties of OTEL collector processors

  • Otel deployment patterns

Stay tuned to continue your journey on learning OpenTelemetry from ground up.

The growing pain of modern systems

Picture this: It’s 3 AM and PagerDuty screams. The alert reads: High API Error Rate on Payments Service. Your heart sinks. You’re awake, you’re looking at a dashboard, and you can confirm that yes, the 5xx error rate is through the roof. The monitoring system has done its job; it told you what is broken. But now the real work begins. Why is it broken? Is it a bad deploy? A database connection pool exhaustion? A downstream service failing?

This is the daily reality for a Site Reliability Engineer (SRE). In this world, knowing that something is wrong is just the first, and frankly, easiest step. The real challenge lies in quickly understanding the root cause across a complex web of microservices.

This is where we move beyond simple monitoring into observability. It’s the difference between knowing your car’s engine light is on and having the full diagnostic data to know exactly which sensor failed and why.

Observability vs. monitoring: what's the difference?

The terms "monitoring" and "observability" are often used interchangeably, but they represent a critical evolution in how we manage systems. Understanding the distinction is fundamental for any SRE.

Monitoring is the practice of collecting and analyzing data based on a predefined set of metrics and logs. It’s about watching for problems you already know can happen.

💡 Monitoring answers questions about known-knowns, failure modes we have anticipated and built dashboards for.

It tells you when something is wrong and what is wrong, like "CPU utilization is at 90%" or "API latency is over 500ms".

Observability, on the other hand, is a property of a system that allows you to understand its internal state by examining its external outputs.

💡 While monitoring tells you that something is wrong, observability gives you the power to explore and ask why it’s wrong. It’s designed to help you debug the unknown-unknowns, novel problems you couldn’t have predicted.

Observability doesn't just look at individual components; it provides a holistic view of the entire distributed system, enabling you to trace issues across service boundaries to find the true root cause.

Crucially, monitoring is a foundational part of observability. You can't have an observable system without first monitoring it to generate the necessary data.

The three pillars of observability

Observability is built on three core types of telemetry data, often called the "three pillars":

  1. Metrics: Numerical measurements aggregated over time. They are the vital signs of your application: CPU usage, request rates, error counts. Metrics are efficient and great for dashboards and alerting on thresholds.

  2. Logs: Timestamped, immutable records of discrete events. They provide detailed, granular context about a specific error or operation, telling you the specifics of what happened at a particular moment.

  3. Traces: The cornerstone of observability in distributed systems. A trace represents the complete, end-to-end journey of a single request as it propagates through multiple microservices. Traces answer the crucial question of where a problem occurred in a complex workflow.

By unifying these three signals, an SRE can go from a high-level alert (a metric) to the specific error message (a log) and see exactly which downstream service call failed and caused the error (a trace).

The problem: a world of many agents

For years, the path to collecting this telemetry data was fragmented. If you wanted metrics, you might use a Prometheus exporter. For logs, a Fluentd agent. For traces, you’d install a proprietary APM agent from a vendor like Datadog, New Relic, or AppDynamics. Each tool had its own agent, its own configuration, and its own data format.

This created a "tower of babel" for telemetry. Engineering teams were burdened with managing multiple agents, and data was siloed in different backends. Worst of all, it created deep vendor lock-in. If you instrumented your entire codebase with one vendor's agent, switching to another was a massive, cost-prohibitive undertaking that required re-instrumenting every single service.

The solution: how OpenTelemetry came to be

The open-source community recognized this problem and produced two parallel projects: OpenTracing, which focused on a standard API for tracing, and OpenCensus, which provided libraries for collecting both traces and metrics. While both were steps in the right direction, they split the community.

In 2019, these two projects merged under the Cloud Native Computing Foundation (CNCF) to form OpenTelemetry (OTEL). OTEL combined the strengths of both, creating a single, unified, open-source standard for all telemetry data.

What exactly is OpenTelemetry?

OpenTelemetry is an open-source observability framework comprising a collection of APIs, SDKs, and tools designed to standardize the generation, collection, and exportation of telemetry data: metrics, logs, and traces.

It is critical to understand what OpenTelemetry is not. OTEL is not an observability backend. It does not provide a database for storing data or a user interface for visualizing it. Instead, OTEL is the transport layer. It is the standardized plumbing that decouples your application's instrumentation from the analysis tool you choose. It focuses on three key jobs:

  1. Instrument: Providing the APIs and SDKs to generate telemetry from your code.

  2. Pipeline: Offering the OpenTelemetry Collector to process and route that data.

  3. Connect: Using exporters to send the data to a sink (a backend like Jaeger, Prometheus, or Datadog).

Why OTEL matters: the value of no lock-in

For a pragmatic SRE, the most significant benefit of OpenTelemetry is strategic: it eliminates vendor lock-in. Because your code is instrumented against a vendor-neutral standard, you regain control over your data and your tooling choices.

Consider a common scenario: A company is using a premium, all-in-one observability platform. It's powerful but expensive. For their production environment, the cost is justified. But for staging, development, and testing environments, they are paying for features they don't need. They want to use a more cost-effective open-source stack like Prometheus and Grafana for non-prod environments to save money.

  • The Old Way (Proprietary Agents): You would have to maintain two separate instrumentation codebases in every single application, one for the commercial vendor and one for Prometheus. This is unmaintainable and would never get done.

  • The OTEL Way: With OpenTelemetry, this becomes a simple configuration change. The application code is instrumented only once with the OTEL SDK. The magic happens in the OpenTelemetry Collector, which can be configured to route telemetry based on its attributes. You can set up a pipeline that says, "If environment=production, send data to the Datadog exporter. If environment=staging, send data to the Prometheus exporter." No application code changes are needed.

This flexibility gives SREs the ability to choose the best tool for the job and optimize for both performance and cost.

A balanced view: the trade-offs of OpenTelemetry

While the flexibility of OpenTelemetry is a significant advantage, it's not a silver bullet. This freedom comes with increased responsibility. Proprietary agents, like those from Datadog, often provide more extensive "out-of-the-box" auto-instrumentation that covers a wide array of technologies with minimal setup.

With OpenTelemetry, while auto-instrumentation for common frameworks is available, you may find yourself needing to write more manual instrumentation to achieve the same depth of visibility. This is the core trade-off: you gain vendor neutrality at the potential cost of convenience. Committing to OTEL means you might lose access to certain tightly integrated, vendor-specific features that work seamlessly with their proprietary agents. It's a strategic decision that weighs the long-term benefit of control against the short-term benefit of a more managed, all-in-one solution.

How OTEL proved its worth in a real-world migration scenario.

Let's look at a more concrete example of OTEL's strategic power. A fast-growing fintech company was using AppDynamics for their Application Performance Monitoring. It was powerful, but as they scaled, the costs became astronomical. They decided to migrate to Datadog, which offered a more flexible pricing model for their needs.

The Challenge: The company had hundreds of microservices, all deeply instrumented with proprietary AppDynamics agents. The engineering cost to rip out the old instrumentation and replace it with Datadog's agents across every service was estimated to be thousands of hours, a project so large and risky it was a non-starter.

An OTEL-based approach: Instead of a "big bang" re-instrumentation, they adopted OpenTelemetry. Their migration became a phased, manageable process:

  1. Standardize on OTEL: All new services were instrumented using the vendor-neutral OpenTelemetry SDKs.

  2. Deploy the Collector: They deployed the OpenTelemetry Collector and configured it to receive data in multiple formats.

  3. Dual-Exporting: For a transition period, they configured the Collector to export telemetry to both AppDynamics and Datadog simultaneously. It's important to note that this strategy was applied incrementally, starting with a small set of critical services. Sending all telemetry from all services to two external backends can significantly increase data egress costs, so a phased approach is crucial. This allowed them to build out their new Datadog dashboards and alerts and validate them against the old system in real-time, reducing the risk of losing visibility without incurring excessive costs.

  4. Migrate Incrementally: They gradually replaced the legacy AppDynamics agents in older services with OTEL instrumentation at their own pace, during regular maintenance cycles.

  5. Flip the Switch: Once the migration was complete and validated, they simply removed the AppDynamics exporter from their Collector configuration. The final cutover was a low-risk, controlled event.

The Outcome: This phased approach allowed the migration to be completed with predictable, managed risk and far less engineering disruption than a "rip and replace" project would have entailed. More importantly, they successfully broke their vendor lock-in. They now have the freedom to use Datadog for production while using a self-hosted Prometheus stack for development, all controlled via a central Collector configuration, giving them greater control over their observability strategy and costs.

What's next?

This guide is just the tip of the iceberg for OpenTelemetry. Now that you understand the "why," and "what," you can start exploring more advanced topics.

In upcoming posts, we'll be exploring:

  • The nitty-gritties of OTEL collector processors

  • Otel deployment patterns

Stay tuned to continue your journey on learning OpenTelemetry from ground up.

The growing pain of modern systems

Picture this: It’s 3 AM and PagerDuty screams. The alert reads: High API Error Rate on Payments Service. Your heart sinks. You’re awake, you’re looking at a dashboard, and you can confirm that yes, the 5xx error rate is through the roof. The monitoring system has done its job; it told you what is broken. But now the real work begins. Why is it broken? Is it a bad deploy? A database connection pool exhaustion? A downstream service failing?

This is the daily reality for a Site Reliability Engineer (SRE). In this world, knowing that something is wrong is just the first, and frankly, easiest step. The real challenge lies in quickly understanding the root cause across a complex web of microservices.

This is where we move beyond simple monitoring into observability. It’s the difference between knowing your car’s engine light is on and having the full diagnostic data to know exactly which sensor failed and why.

Observability vs. monitoring: what's the difference?

The terms "monitoring" and "observability" are often used interchangeably, but they represent a critical evolution in how we manage systems. Understanding the distinction is fundamental for any SRE.

Monitoring is the practice of collecting and analyzing data based on a predefined set of metrics and logs. It’s about watching for problems you already know can happen.

💡 Monitoring answers questions about known-knowns, failure modes we have anticipated and built dashboards for.

It tells you when something is wrong and what is wrong, like "CPU utilization is at 90%" or "API latency is over 500ms".

Observability, on the other hand, is a property of a system that allows you to understand its internal state by examining its external outputs.

💡 While monitoring tells you that something is wrong, observability gives you the power to explore and ask why it’s wrong. It’s designed to help you debug the unknown-unknowns, novel problems you couldn’t have predicted.

Observability doesn't just look at individual components; it provides a holistic view of the entire distributed system, enabling you to trace issues across service boundaries to find the true root cause.

Crucially, monitoring is a foundational part of observability. You can't have an observable system without first monitoring it to generate the necessary data.

The three pillars of observability

Observability is built on three core types of telemetry data, often called the "three pillars":

  1. Metrics: Numerical measurements aggregated over time. They are the vital signs of your application: CPU usage, request rates, error counts. Metrics are efficient and great for dashboards and alerting on thresholds.

  2. Logs: Timestamped, immutable records of discrete events. They provide detailed, granular context about a specific error or operation, telling you the specifics of what happened at a particular moment.

  3. Traces: The cornerstone of observability in distributed systems. A trace represents the complete, end-to-end journey of a single request as it propagates through multiple microservices. Traces answer the crucial question of where a problem occurred in a complex workflow.

By unifying these three signals, an SRE can go from a high-level alert (a metric) to the specific error message (a log) and see exactly which downstream service call failed and caused the error (a trace).

The problem: a world of many agents

For years, the path to collecting this telemetry data was fragmented. If you wanted metrics, you might use a Prometheus exporter. For logs, a Fluentd agent. For traces, you’d install a proprietary APM agent from a vendor like Datadog, New Relic, or AppDynamics. Each tool had its own agent, its own configuration, and its own data format.

This created a "tower of babel" for telemetry. Engineering teams were burdened with managing multiple agents, and data was siloed in different backends. Worst of all, it created deep vendor lock-in. If you instrumented your entire codebase with one vendor's agent, switching to another was a massive, cost-prohibitive undertaking that required re-instrumenting every single service.

The solution: how OpenTelemetry came to be

The open-source community recognized this problem and produced two parallel projects: OpenTracing, which focused on a standard API for tracing, and OpenCensus, which provided libraries for collecting both traces and metrics. While both were steps in the right direction, they split the community.

In 2019, these two projects merged under the Cloud Native Computing Foundation (CNCF) to form OpenTelemetry (OTEL). OTEL combined the strengths of both, creating a single, unified, open-source standard for all telemetry data.

What exactly is OpenTelemetry?

OpenTelemetry is an open-source observability framework comprising a collection of APIs, SDKs, and tools designed to standardize the generation, collection, and exportation of telemetry data: metrics, logs, and traces.

It is critical to understand what OpenTelemetry is not. OTEL is not an observability backend. It does not provide a database for storing data or a user interface for visualizing it. Instead, OTEL is the transport layer. It is the standardized plumbing that decouples your application's instrumentation from the analysis tool you choose. It focuses on three key jobs:

  1. Instrument: Providing the APIs and SDKs to generate telemetry from your code.

  2. Pipeline: Offering the OpenTelemetry Collector to process and route that data.

  3. Connect: Using exporters to send the data to a sink (a backend like Jaeger, Prometheus, or Datadog).

Why OTEL matters: the value of no lock-in

For a pragmatic SRE, the most significant benefit of OpenTelemetry is strategic: it eliminates vendor lock-in. Because your code is instrumented against a vendor-neutral standard, you regain control over your data and your tooling choices.

Consider a common scenario: A company is using a premium, all-in-one observability platform. It's powerful but expensive. For their production environment, the cost is justified. But for staging, development, and testing environments, they are paying for features they don't need. They want to use a more cost-effective open-source stack like Prometheus and Grafana for non-prod environments to save money.

  • The Old Way (Proprietary Agents): You would have to maintain two separate instrumentation codebases in every single application, one for the commercial vendor and one for Prometheus. This is unmaintainable and would never get done.

  • The OTEL Way: With OpenTelemetry, this becomes a simple configuration change. The application code is instrumented only once with the OTEL SDK. The magic happens in the OpenTelemetry Collector, which can be configured to route telemetry based on its attributes. You can set up a pipeline that says, "If environment=production, send data to the Datadog exporter. If environment=staging, send data to the Prometheus exporter." No application code changes are needed.

This flexibility gives SREs the ability to choose the best tool for the job and optimize for both performance and cost.

A balanced view: the trade-offs of OpenTelemetry

While the flexibility of OpenTelemetry is a significant advantage, it's not a silver bullet. This freedom comes with increased responsibility. Proprietary agents, like those from Datadog, often provide more extensive "out-of-the-box" auto-instrumentation that covers a wide array of technologies with minimal setup.

With OpenTelemetry, while auto-instrumentation for common frameworks is available, you may find yourself needing to write more manual instrumentation to achieve the same depth of visibility. This is the core trade-off: you gain vendor neutrality at the potential cost of convenience. Committing to OTEL means you might lose access to certain tightly integrated, vendor-specific features that work seamlessly with their proprietary agents. It's a strategic decision that weighs the long-term benefit of control against the short-term benefit of a more managed, all-in-one solution.

How OTEL proved its worth in a real-world migration scenario.

Let's look at a more concrete example of OTEL's strategic power. A fast-growing fintech company was using AppDynamics for their Application Performance Monitoring. It was powerful, but as they scaled, the costs became astronomical. They decided to migrate to Datadog, which offered a more flexible pricing model for their needs.

The Challenge: The company had hundreds of microservices, all deeply instrumented with proprietary AppDynamics agents. The engineering cost to rip out the old instrumentation and replace it with Datadog's agents across every service was estimated to be thousands of hours, a project so large and risky it was a non-starter.

An OTEL-based approach: Instead of a "big bang" re-instrumentation, they adopted OpenTelemetry. Their migration became a phased, manageable process:

  1. Standardize on OTEL: All new services were instrumented using the vendor-neutral OpenTelemetry SDKs.

  2. Deploy the Collector: They deployed the OpenTelemetry Collector and configured it to receive data in multiple formats.

  3. Dual-Exporting: For a transition period, they configured the Collector to export telemetry to both AppDynamics and Datadog simultaneously. It's important to note that this strategy was applied incrementally, starting with a small set of critical services. Sending all telemetry from all services to two external backends can significantly increase data egress costs, so a phased approach is crucial. This allowed them to build out their new Datadog dashboards and alerts and validate them against the old system in real-time, reducing the risk of losing visibility without incurring excessive costs.

  4. Migrate Incrementally: They gradually replaced the legacy AppDynamics agents in older services with OTEL instrumentation at their own pace, during regular maintenance cycles.

  5. Flip the Switch: Once the migration was complete and validated, they simply removed the AppDynamics exporter from their Collector configuration. The final cutover was a low-risk, controlled event.

The Outcome: This phased approach allowed the migration to be completed with predictable, managed risk and far less engineering disruption than a "rip and replace" project would have entailed. More importantly, they successfully broke their vendor lock-in. They now have the freedom to use Datadog for production while using a self-hosted Prometheus stack for development, all controlled via a central Collector configuration, giving them greater control over their observability strategy and costs.

What's next?

This guide is just the tip of the iceberg for OpenTelemetry. Now that you understand the "why," and "what," you can start exploring more advanced topics.

In upcoming posts, we'll be exploring:

  • The nitty-gritties of OTEL collector processors

  • Otel deployment patterns

Stay tuned to continue your journey on learning OpenTelemetry from ground up.

The growing pain of modern systems

Picture this: It’s 3 AM and PagerDuty screams. The alert reads: High API Error Rate on Payments Service. Your heart sinks. You’re awake, you’re looking at a dashboard, and you can confirm that yes, the 5xx error rate is through the roof. The monitoring system has done its job; it told you what is broken. But now the real work begins. Why is it broken? Is it a bad deploy? A database connection pool exhaustion? A downstream service failing?

This is the daily reality for a Site Reliability Engineer (SRE). In this world, knowing that something is wrong is just the first, and frankly, easiest step. The real challenge lies in quickly understanding the root cause across a complex web of microservices.

This is where we move beyond simple monitoring into observability. It’s the difference between knowing your car’s engine light is on and having the full diagnostic data to know exactly which sensor failed and why.

Observability vs. monitoring: what's the difference?

The terms "monitoring" and "observability" are often used interchangeably, but they represent a critical evolution in how we manage systems. Understanding the distinction is fundamental for any SRE.

Monitoring is the practice of collecting and analyzing data based on a predefined set of metrics and logs. It’s about watching for problems you already know can happen.

💡 Monitoring answers questions about known-knowns, failure modes we have anticipated and built dashboards for.

It tells you when something is wrong and what is wrong, like "CPU utilization is at 90%" or "API latency is over 500ms".

Observability, on the other hand, is a property of a system that allows you to understand its internal state by examining its external outputs.

💡 While monitoring tells you that something is wrong, observability gives you the power to explore and ask why it’s wrong. It’s designed to help you debug the unknown-unknowns, novel problems you couldn’t have predicted.

Observability doesn't just look at individual components; it provides a holistic view of the entire distributed system, enabling you to trace issues across service boundaries to find the true root cause.

Crucially, monitoring is a foundational part of observability. You can't have an observable system without first monitoring it to generate the necessary data.

The three pillars of observability

Observability is built on three core types of telemetry data, often called the "three pillars":

  1. Metrics: Numerical measurements aggregated over time. They are the vital signs of your application: CPU usage, request rates, error counts. Metrics are efficient and great for dashboards and alerting on thresholds.

  2. Logs: Timestamped, immutable records of discrete events. They provide detailed, granular context about a specific error or operation, telling you the specifics of what happened at a particular moment.

  3. Traces: The cornerstone of observability in distributed systems. A trace represents the complete, end-to-end journey of a single request as it propagates through multiple microservices. Traces answer the crucial question of where a problem occurred in a complex workflow.

By unifying these three signals, an SRE can go from a high-level alert (a metric) to the specific error message (a log) and see exactly which downstream service call failed and caused the error (a trace).

The problem: a world of many agents

For years, the path to collecting this telemetry data was fragmented. If you wanted metrics, you might use a Prometheus exporter. For logs, a Fluentd agent. For traces, you’d install a proprietary APM agent from a vendor like Datadog, New Relic, or AppDynamics. Each tool had its own agent, its own configuration, and its own data format.

This created a "tower of babel" for telemetry. Engineering teams were burdened with managing multiple agents, and data was siloed in different backends. Worst of all, it created deep vendor lock-in. If you instrumented your entire codebase with one vendor's agent, switching to another was a massive, cost-prohibitive undertaking that required re-instrumenting every single service.

The solution: how OpenTelemetry came to be

The open-source community recognized this problem and produced two parallel projects: OpenTracing, which focused on a standard API for tracing, and OpenCensus, which provided libraries for collecting both traces and metrics. While both were steps in the right direction, they split the community.

In 2019, these two projects merged under the Cloud Native Computing Foundation (CNCF) to form OpenTelemetry (OTEL). OTEL combined the strengths of both, creating a single, unified, open-source standard for all telemetry data.

What exactly is OpenTelemetry?

OpenTelemetry is an open-source observability framework comprising a collection of APIs, SDKs, and tools designed to standardize the generation, collection, and exportation of telemetry data: metrics, logs, and traces.

It is critical to understand what OpenTelemetry is not. OTEL is not an observability backend. It does not provide a database for storing data or a user interface for visualizing it. Instead, OTEL is the transport layer. It is the standardized plumbing that decouples your application's instrumentation from the analysis tool you choose. It focuses on three key jobs:

  1. Instrument: Providing the APIs and SDKs to generate telemetry from your code.

  2. Pipeline: Offering the OpenTelemetry Collector to process and route that data.

  3. Connect: Using exporters to send the data to a sink (a backend like Jaeger, Prometheus, or Datadog).

Why OTEL matters: the value of no lock-in

For a pragmatic SRE, the most significant benefit of OpenTelemetry is strategic: it eliminates vendor lock-in. Because your code is instrumented against a vendor-neutral standard, you regain control over your data and your tooling choices.

Consider a common scenario: A company is using a premium, all-in-one observability platform. It's powerful but expensive. For their production environment, the cost is justified. But for staging, development, and testing environments, they are paying for features they don't need. They want to use a more cost-effective open-source stack like Prometheus and Grafana for non-prod environments to save money.

  • The Old Way (Proprietary Agents): You would have to maintain two separate instrumentation codebases in every single application, one for the commercial vendor and one for Prometheus. This is unmaintainable and would never get done.

  • The OTEL Way: With OpenTelemetry, this becomes a simple configuration change. The application code is instrumented only once with the OTEL SDK. The magic happens in the OpenTelemetry Collector, which can be configured to route telemetry based on its attributes. You can set up a pipeline that says, "If environment=production, send data to the Datadog exporter. If environment=staging, send data to the Prometheus exporter." No application code changes are needed.

This flexibility gives SREs the ability to choose the best tool for the job and optimize for both performance and cost.

A balanced view: the trade-offs of OpenTelemetry

While the flexibility of OpenTelemetry is a significant advantage, it's not a silver bullet. This freedom comes with increased responsibility. Proprietary agents, like those from Datadog, often provide more extensive "out-of-the-box" auto-instrumentation that covers a wide array of technologies with minimal setup.

With OpenTelemetry, while auto-instrumentation for common frameworks is available, you may find yourself needing to write more manual instrumentation to achieve the same depth of visibility. This is the core trade-off: you gain vendor neutrality at the potential cost of convenience. Committing to OTEL means you might lose access to certain tightly integrated, vendor-specific features that work seamlessly with their proprietary agents. It's a strategic decision that weighs the long-term benefit of control against the short-term benefit of a more managed, all-in-one solution.

How OTEL proved its worth in a real-world migration scenario.

Let's look at a more concrete example of OTEL's strategic power. A fast-growing fintech company was using AppDynamics for their Application Performance Monitoring. It was powerful, but as they scaled, the costs became astronomical. They decided to migrate to Datadog, which offered a more flexible pricing model for their needs.

The Challenge: The company had hundreds of microservices, all deeply instrumented with proprietary AppDynamics agents. The engineering cost to rip out the old instrumentation and replace it with Datadog's agents across every service was estimated to be thousands of hours, a project so large and risky it was a non-starter.

An OTEL-based approach: Instead of a "big bang" re-instrumentation, they adopted OpenTelemetry. Their migration became a phased, manageable process:

  1. Standardize on OTEL: All new services were instrumented using the vendor-neutral OpenTelemetry SDKs.

  2. Deploy the Collector: They deployed the OpenTelemetry Collector and configured it to receive data in multiple formats.

  3. Dual-Exporting: For a transition period, they configured the Collector to export telemetry to both AppDynamics and Datadog simultaneously. It's important to note that this strategy was applied incrementally, starting with a small set of critical services. Sending all telemetry from all services to two external backends can significantly increase data egress costs, so a phased approach is crucial. This allowed them to build out their new Datadog dashboards and alerts and validate them against the old system in real-time, reducing the risk of losing visibility without incurring excessive costs.

  4. Migrate Incrementally: They gradually replaced the legacy AppDynamics agents in older services with OTEL instrumentation at their own pace, during regular maintenance cycles.

  5. Flip the Switch: Once the migration was complete and validated, they simply removed the AppDynamics exporter from their Collector configuration. The final cutover was a low-risk, controlled event.

The Outcome: This phased approach allowed the migration to be completed with predictable, managed risk and far less engineering disruption than a "rip and replace" project would have entailed. More importantly, they successfully broke their vendor lock-in. They now have the freedom to use Datadog for production while using a self-hosted Prometheus stack for development, all controlled via a central Collector configuration, giving them greater control over their observability strategy and costs.

What's next?

This guide is just the tip of the iceberg for OpenTelemetry. Now that you understand the "why," and "what," you can start exploring more advanced topics.

In upcoming posts, we'll be exploring:

  • The nitty-gritties of OTEL collector processors

  • Otel deployment patterns

Stay tuned to continue your journey on learning OpenTelemetry from ground up.

Share

Jump to section

Related posts

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.