Teams rarely set out to build black-box APIs, yet that is exactly how many of them behave in production. When latency spikes or errors appear, teams often know something is wrong but still cannot explain why quickly enough.
In this post, we’ll build a small Go API in a way that makes it observable from day zero, not after the first painful incident. We’ll use OpenTelemetry for traces and context propagation, Go’s slog for structured logs, and Prometheus for metrics, because the goal is not to collect more telemetry but to make production behaviour explainable.
If you want to get comfortable building production-grade REST APIs in Go first, start with our Go Bootcamp.
What observable means
Monitoring tells you whether an API is up; observability helps you explain strange behaviour you did not explicitly predict when you shipped it. For an API, that means being able to answer which requests are failing, how latency varies by endpoint or tenant, and what changed around the time things broke.
The key is that metrics, logs, and traces must reinforce each other. Metrics tell you that something moved, traces tell you where time went, and logs give you detailed request context.
The demo service
To make this a follow-along, we’ll build a tiny service with two endpoints:
GET /healthfor a quick health check.GET /users/{id}that simulates a downstream dependency call.
By the end, the service will:
Return a request ID in every response.
Emit structured logs for every request.
Expose Prometheus metrics on
/metrics.Create traces that show request flow and downstream latency.
That design-first approach is the same mindset behind Why your Architecture should start with Questions, not boxes: the important operational choices are made before production traffic shows up.
Step 1: Start with the API contract
Most teams treat observability as an implementation detail, but some of the highest-value decisions belong in the API contract. A standard response shape gives clients a predictable model and gives operators a reliable way to correlate failures across systems.
Use a single response envelope like this:
type APIResponse struct { Data any `json:"data,omitempty"` Error string `json:"error,omitempty"` ErrorCode string `json:"error_code,omitempty"` RequestID string `json:"request_id"` }
And a small helper to write responses consistently:
func writeJSON(w http.ResponseWriter, status int, resp APIResponse) { w.Header().Set("Content-Type", "application/json") w.WriteHeader(status) _ = json.NewEncoder(w).Encode(resp) }
A good rule here is to standardise three things early:
A machine-readable error code like
validation_errororupstream_timeout.A human-readable error string.
A request ID returned to the caller.
That one decision makes dashboards easier to group, support tickets easier to trace, and logs easier to search. If you already design APIs for service boundaries, this also pairs well with One2N’s Backend Engineering practice page for framing APIs as long-lived operational surfaces.
Checkpoint
At this stage, your API does not need OTEL or Prometheus yet. It should simply return a stable JSON shape for both success and error paths.
Step 2: Bootstrap tracing and metrics early
If tracing and metrics are initialised late or inconsistently, gaps appear exactly where you need clarity most. Set up the OTEL SDK, propagators, and metrics exposure before any handler starts serving traffic.
A simple OTEL tracer bootstrap in Go can look like this:
package telemetry import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/propagation" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.26.0" ) func InitTracerProvider(ctx context.Context, serviceName string) (*sdktrace.TracerProvider, error) { exporter, err := otlptracehttp.New(ctx) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceName(serviceName), )), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator( propagation.NewCompositeTextMapPropagator( propagation.TraceContext{}, propagation.Baggage{}, ), ) return tp, nil }
Expose Prometheus metrics on a separate endpoint:
go func() { mux := http.NewServeMux() mux.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":9090", mux); err != nil { log.Fatal(err) } }()
This is a good place to add a note in the live post that /metrics can run on a separate internal port if you do not want to expose it on the public API surface.
Checkpoint
By now, your service should:
Start without handler changes.
Export traces to an OTLP-compatible backend.
Expose
/metricson port9090.
Step 3: Assign and propagate request IDs
You cannot debug distributed systems without consistent request identity. Correlation IDs are what let one customer complaint become a searchable trail across logs, traces, and downstream calls.
A small middleware can attach a request ID to every request:
package middleware import ( "context" "net/http" "github.com/google/uuid" ) type contextKey string const RequestIDKey contextKey = "request_id" func RequestID(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { requestID := r.Header.Get("X-Request-Id") if requestID == "" { requestID = uuid.NewString() } ctx := context.WithValue(r.Context(), RequestIDKey, requestID) w.Header().Set("X-Request-Id", requestID) next.ServeHTTP(w, r.WithContext(ctx)) }) } func GetRequestID(ctx context.Context) string { if v, ok := ctx.Value(RequestIDKey).(string); ok { return v } return "" }
Then use it in your handlers:
func getUserHandler(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) resp := APIResponse{ Data: map[string]any{ "id": "123", "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) }
This is the moment where the article should feel interactive. Make one request with curl, inspect the response headers, and confirm that the same request ID appears in both the response body and the X-Request-Id header.
Try it
curl -i http://localhost:8080/users/123
You should see:
X-Request-Idin the response headers.request_idin the JSON body.
Step 4: Make logs structured and useful
Text logs are fine when skimming one process locally, but they break down fast across services and incidents. Observable APIs need logs that are structured, queryable, and rich enough to explain a specific request.
Go’s slog is a good default:
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ Level: slog.LevelInfo, }))
Log with request context in handlers:
func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) userID := r.PathValue("id") logger.Info("fetching user", "request_id", requestID, "endpoint", "/users/{id}", "user_id", userID, "method", r.Method, ) resp := APIResponse{ Data: map[string]any{ "id": userID, "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) } }
The practical rule is simple:
Log in JSON.
Use consistent field names.
Add business context only when it helps explain behaviour.
Avoid two common mistakes:
Free-form log messages with the important values buried in strings.
Logging every tiny event with no consistent keys, which makes production search noisy and expensive.
Checkpoint
Make one request and verify that your logs include:
request_idendpointmethoduser_idwhere relevant.
Step 5: Add tracing at the boundaries
Metrics tell you that something is wrong; traces tell you where the time went. For APIs that call databases, caches, or third-party services, tracing stops being optional very quickly.
Start a span in the handler:
var tracer = otel.Tracer("users-api") func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { ctx, span := tracer.Start(r.Context(), "getUserHandler") defer span.End() userID := r.PathValue("id") requestID := middleware.GetRequestID(ctx) span.SetAttributes( attribute.String("http.method", r.Method), attribute.String("http.route", "/users/{id}"), attribute.String("app.request_id", requestID), attribute.String("app.user_id", userID), ) user, err := fetchUserFromDownstream(ctx, userID) if err != nil { span.RecordError(err) span.SetStatus(codes.Error, "downstream failure") writeJSON(w, http.StatusBadGateway, APIResponse{ Error: "failed to fetch user", ErrorCode: "upstream_timeout", RequestID: requestID, }) return } writeJSON(w, http.StatusOK, APIResponse{ Data: user, RequestID: requestID, }) } }
Wrap the downstream boundary too:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
The key idea is to instrument boundaries, not every tiny function. Handlers, external calls, DB queries, and cache lookups are usually enough to make traces useful without turning them noisy.
Checkpoint
By now, one request to /users/123 should produce:
A parent span for the handler.
A child span for the downstream fetch.
Shared request context between the trace and logs.
Step 6: Expose the right metrics
APIs produce a lot of numbers, but only a few matter every day. Start with SLI-shaped metrics: latency, throughput, and error rate.
A simple request counter and latency histogram in Prometheus might look like this:
var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "route", "status"}, ) httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request latency", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status"}, ) )
And wrap handlers with instrumentation:
func instrument(route string, next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() rw := &statusRecorder{ResponseWriter: w, statusCode: http.StatusOK} next.ServeHTTP(rw, r) status := strconv.Itoa(rw.statusCode) duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues(r.Method, route, status).Inc() httpRequestDuration.WithLabelValues(r.Method, route, status).Observe(duration) }) } type statusRecorder struct { http.ResponseWriter statusCode int } func (r *statusRecorder) WriteHeader(statusCode int) { r.statusCode = statusCode r.ResponseWriter.WriteHeader(statusCode) }
A useful note here is what not to label:
Do not label metrics with raw
user_idorrequest_id, because that creates high-cardinality metrics that become hard to store and query efficiently.Prefer stable, low-cardinality dimensions such as route, method, status, tenant tier, or region when needed.
This is also a natural spot to reference One2N’s CI/CD page or SRE Bootcamp, since these metrics become most useful when tied to release and reliability workflows.
Checkpoint
Hit the API a few times and then open /metrics. You should be able to find:
http_requests_totalhttp_request_duration_seconds.
Step 7: Break it on purpose
A follow-along post becomes more useful when readers can observe a failure, not just a healthy request path. So let’s simulate a flaky downstream dependency.
Change the downstream function:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() if userID == "500" { err := errors.New("downstream timeout") span.RecordError(err) span.SetStatus(codes.Error, err.Error()) time.Sleep(800 * time.Millisecond) return nil, err } time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
Now call:
curl -i http://localhost:8080/users/500
You should now see the three signals line up:
The response returns an error payload with
error_code=upstream_timeoutand arequest_id.The logs show the same request ID and an error path.
The trace shows the failing downstream span and the extra latency.
The metrics reflect a slower request and a 502 response.
This is also where a real-world internal cross-link helps. One2N’s How one parameter stalled Postgres Replication with Debezium for 3 Weeks is a good companion read because it shows why production debugging gets expensive when systems are operationally opaque.
Step 8: Make observability part of delivery
If observability only appears at the end of implementation, it will always feel like overhead. The better pattern is to make it part of how endpoints are shipped.
A practical team checklist looks like this:
Every new endpoint returns a request ID.
Every handler emits structured logs with stable keys.
Every external dependency call has a span around it.
Every critical path exposes latency and error metrics.
Dashboards and alerts live in version control alongside code where possible.
That “observability as code” mindset lines up well with One2N’s Gitops for Kafka in the real world, where operational control improves when systems are expressed explicitly rather than left implicit.
What you should have now
At this point, your small Go API should do five useful things:
Return consistent response envelopes.
Attach and propagate request IDs.
Emit structured JSON logs.
Expose Prometheus metrics.
Create traces that explain downstream latency and failure paths.
That does not make the service perfect. It does make it far easier to operate, debug, and evolve under real traffic.
Before you ship
Before you ship the next API or feature, check for these basics:
Every request has a correlation ID that flows across services and appears in responses.
Logs are structured, centralised, and enriched with useful request context.
Key SLIs such as latency, throughput, and error rate exist for critical endpoints.
Traces show boundary spans for downstream calls and capture failure states.
Metrics, alerts, and dashboard assumptions are treated as part of delivery, not post-release cleanup.
That is what “observable from day zero” really means in practice
Teams rarely set out to build black-box APIs, yet that is exactly how many of them behave in production. When latency spikes or errors appear, teams often know something is wrong but still cannot explain why quickly enough.
In this post, we’ll build a small Go API in a way that makes it observable from day zero, not after the first painful incident. We’ll use OpenTelemetry for traces and context propagation, Go’s slog for structured logs, and Prometheus for metrics, because the goal is not to collect more telemetry but to make production behaviour explainable.
If you want to get comfortable building production-grade REST APIs in Go first, start with our Go Bootcamp.
What observable means
Monitoring tells you whether an API is up; observability helps you explain strange behaviour you did not explicitly predict when you shipped it. For an API, that means being able to answer which requests are failing, how latency varies by endpoint or tenant, and what changed around the time things broke.
The key is that metrics, logs, and traces must reinforce each other. Metrics tell you that something moved, traces tell you where time went, and logs give you detailed request context.
The demo service
To make this a follow-along, we’ll build a tiny service with two endpoints:
GET /healthfor a quick health check.GET /users/{id}that simulates a downstream dependency call.
By the end, the service will:
Return a request ID in every response.
Emit structured logs for every request.
Expose Prometheus metrics on
/metrics.Create traces that show request flow and downstream latency.
That design-first approach is the same mindset behind Why your Architecture should start with Questions, not boxes: the important operational choices are made before production traffic shows up.
Step 1: Start with the API contract
Most teams treat observability as an implementation detail, but some of the highest-value decisions belong in the API contract. A standard response shape gives clients a predictable model and gives operators a reliable way to correlate failures across systems.
Use a single response envelope like this:
type APIResponse struct { Data any `json:"data,omitempty"` Error string `json:"error,omitempty"` ErrorCode string `json:"error_code,omitempty"` RequestID string `json:"request_id"` }
And a small helper to write responses consistently:
func writeJSON(w http.ResponseWriter, status int, resp APIResponse) { w.Header().Set("Content-Type", "application/json") w.WriteHeader(status) _ = json.NewEncoder(w).Encode(resp) }
A good rule here is to standardise three things early:
A machine-readable error code like
validation_errororupstream_timeout.A human-readable error string.
A request ID returned to the caller.
That one decision makes dashboards easier to group, support tickets easier to trace, and logs easier to search. If you already design APIs for service boundaries, this also pairs well with One2N’s Backend Engineering practice page for framing APIs as long-lived operational surfaces.
Checkpoint
At this stage, your API does not need OTEL or Prometheus yet. It should simply return a stable JSON shape for both success and error paths.
Step 2: Bootstrap tracing and metrics early
If tracing and metrics are initialised late or inconsistently, gaps appear exactly where you need clarity most. Set up the OTEL SDK, propagators, and metrics exposure before any handler starts serving traffic.
A simple OTEL tracer bootstrap in Go can look like this:
package telemetry import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/propagation" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.26.0" ) func InitTracerProvider(ctx context.Context, serviceName string) (*sdktrace.TracerProvider, error) { exporter, err := otlptracehttp.New(ctx) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceName(serviceName), )), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator( propagation.NewCompositeTextMapPropagator( propagation.TraceContext{}, propagation.Baggage{}, ), ) return tp, nil }
Expose Prometheus metrics on a separate endpoint:
go func() { mux := http.NewServeMux() mux.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":9090", mux); err != nil { log.Fatal(err) } }()
This is a good place to add a note in the live post that /metrics can run on a separate internal port if you do not want to expose it on the public API surface.
Checkpoint
By now, your service should:
Start without handler changes.
Export traces to an OTLP-compatible backend.
Expose
/metricson port9090.
Step 3: Assign and propagate request IDs
You cannot debug distributed systems without consistent request identity. Correlation IDs are what let one customer complaint become a searchable trail across logs, traces, and downstream calls.
A small middleware can attach a request ID to every request:
package middleware import ( "context" "net/http" "github.com/google/uuid" ) type contextKey string const RequestIDKey contextKey = "request_id" func RequestID(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { requestID := r.Header.Get("X-Request-Id") if requestID == "" { requestID = uuid.NewString() } ctx := context.WithValue(r.Context(), RequestIDKey, requestID) w.Header().Set("X-Request-Id", requestID) next.ServeHTTP(w, r.WithContext(ctx)) }) } func GetRequestID(ctx context.Context) string { if v, ok := ctx.Value(RequestIDKey).(string); ok { return v } return "" }
Then use it in your handlers:
func getUserHandler(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) resp := APIResponse{ Data: map[string]any{ "id": "123", "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) }
This is the moment where the article should feel interactive. Make one request with curl, inspect the response headers, and confirm that the same request ID appears in both the response body and the X-Request-Id header.
Try it
curl -i http://localhost:8080/users/123
You should see:
X-Request-Idin the response headers.request_idin the JSON body.
Step 4: Make logs structured and useful
Text logs are fine when skimming one process locally, but they break down fast across services and incidents. Observable APIs need logs that are structured, queryable, and rich enough to explain a specific request.
Go’s slog is a good default:
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ Level: slog.LevelInfo, }))
Log with request context in handlers:
func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) userID := r.PathValue("id") logger.Info("fetching user", "request_id", requestID, "endpoint", "/users/{id}", "user_id", userID, "method", r.Method, ) resp := APIResponse{ Data: map[string]any{ "id": userID, "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) } }
The practical rule is simple:
Log in JSON.
Use consistent field names.
Add business context only when it helps explain behaviour.
Avoid two common mistakes:
Free-form log messages with the important values buried in strings.
Logging every tiny event with no consistent keys, which makes production search noisy and expensive.
Checkpoint
Make one request and verify that your logs include:
request_idendpointmethoduser_idwhere relevant.
Step 5: Add tracing at the boundaries
Metrics tell you that something is wrong; traces tell you where the time went. For APIs that call databases, caches, or third-party services, tracing stops being optional very quickly.
Start a span in the handler:
var tracer = otel.Tracer("users-api") func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { ctx, span := tracer.Start(r.Context(), "getUserHandler") defer span.End() userID := r.PathValue("id") requestID := middleware.GetRequestID(ctx) span.SetAttributes( attribute.String("http.method", r.Method), attribute.String("http.route", "/users/{id}"), attribute.String("app.request_id", requestID), attribute.String("app.user_id", userID), ) user, err := fetchUserFromDownstream(ctx, userID) if err != nil { span.RecordError(err) span.SetStatus(codes.Error, "downstream failure") writeJSON(w, http.StatusBadGateway, APIResponse{ Error: "failed to fetch user", ErrorCode: "upstream_timeout", RequestID: requestID, }) return } writeJSON(w, http.StatusOK, APIResponse{ Data: user, RequestID: requestID, }) } }
Wrap the downstream boundary too:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
The key idea is to instrument boundaries, not every tiny function. Handlers, external calls, DB queries, and cache lookups are usually enough to make traces useful without turning them noisy.
Checkpoint
By now, one request to /users/123 should produce:
A parent span for the handler.
A child span for the downstream fetch.
Shared request context between the trace and logs.
Step 6: Expose the right metrics
APIs produce a lot of numbers, but only a few matter every day. Start with SLI-shaped metrics: latency, throughput, and error rate.
A simple request counter and latency histogram in Prometheus might look like this:
var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "route", "status"}, ) httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request latency", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status"}, ) )
And wrap handlers with instrumentation:
func instrument(route string, next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() rw := &statusRecorder{ResponseWriter: w, statusCode: http.StatusOK} next.ServeHTTP(rw, r) status := strconv.Itoa(rw.statusCode) duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues(r.Method, route, status).Inc() httpRequestDuration.WithLabelValues(r.Method, route, status).Observe(duration) }) } type statusRecorder struct { http.ResponseWriter statusCode int } func (r *statusRecorder) WriteHeader(statusCode int) { r.statusCode = statusCode r.ResponseWriter.WriteHeader(statusCode) }
A useful note here is what not to label:
Do not label metrics with raw
user_idorrequest_id, because that creates high-cardinality metrics that become hard to store and query efficiently.Prefer stable, low-cardinality dimensions such as route, method, status, tenant tier, or region when needed.
This is also a natural spot to reference One2N’s CI/CD page or SRE Bootcamp, since these metrics become most useful when tied to release and reliability workflows.
Checkpoint
Hit the API a few times and then open /metrics. You should be able to find:
http_requests_totalhttp_request_duration_seconds.
Step 7: Break it on purpose
A follow-along post becomes more useful when readers can observe a failure, not just a healthy request path. So let’s simulate a flaky downstream dependency.
Change the downstream function:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() if userID == "500" { err := errors.New("downstream timeout") span.RecordError(err) span.SetStatus(codes.Error, err.Error()) time.Sleep(800 * time.Millisecond) return nil, err } time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
Now call:
curl -i http://localhost:8080/users/500
You should now see the three signals line up:
The response returns an error payload with
error_code=upstream_timeoutand arequest_id.The logs show the same request ID and an error path.
The trace shows the failing downstream span and the extra latency.
The metrics reflect a slower request and a 502 response.
This is also where a real-world internal cross-link helps. One2N’s How one parameter stalled Postgres Replication with Debezium for 3 Weeks is a good companion read because it shows why production debugging gets expensive when systems are operationally opaque.
Step 8: Make observability part of delivery
If observability only appears at the end of implementation, it will always feel like overhead. The better pattern is to make it part of how endpoints are shipped.
A practical team checklist looks like this:
Every new endpoint returns a request ID.
Every handler emits structured logs with stable keys.
Every external dependency call has a span around it.
Every critical path exposes latency and error metrics.
Dashboards and alerts live in version control alongside code where possible.
That “observability as code” mindset lines up well with One2N’s Gitops for Kafka in the real world, where operational control improves when systems are expressed explicitly rather than left implicit.
What you should have now
At this point, your small Go API should do five useful things:
Return consistent response envelopes.
Attach and propagate request IDs.
Emit structured JSON logs.
Expose Prometheus metrics.
Create traces that explain downstream latency and failure paths.
That does not make the service perfect. It does make it far easier to operate, debug, and evolve under real traffic.
Before you ship
Before you ship the next API or feature, check for these basics:
Every request has a correlation ID that flows across services and appears in responses.
Logs are structured, centralised, and enriched with useful request context.
Key SLIs such as latency, throughput, and error rate exist for critical endpoints.
Traces show boundary spans for downstream calls and capture failure states.
Metrics, alerts, and dashboard assumptions are treated as part of delivery, not post-release cleanup.
That is what “observable from day zero” really means in practice
Teams rarely set out to build black-box APIs, yet that is exactly how many of them behave in production. When latency spikes or errors appear, teams often know something is wrong but still cannot explain why quickly enough.
In this post, we’ll build a small Go API in a way that makes it observable from day zero, not after the first painful incident. We’ll use OpenTelemetry for traces and context propagation, Go’s slog for structured logs, and Prometheus for metrics, because the goal is not to collect more telemetry but to make production behaviour explainable.
If you want to get comfortable building production-grade REST APIs in Go first, start with our Go Bootcamp.
What observable means
Monitoring tells you whether an API is up; observability helps you explain strange behaviour you did not explicitly predict when you shipped it. For an API, that means being able to answer which requests are failing, how latency varies by endpoint or tenant, and what changed around the time things broke.
The key is that metrics, logs, and traces must reinforce each other. Metrics tell you that something moved, traces tell you where time went, and logs give you detailed request context.
The demo service
To make this a follow-along, we’ll build a tiny service with two endpoints:
GET /healthfor a quick health check.GET /users/{id}that simulates a downstream dependency call.
By the end, the service will:
Return a request ID in every response.
Emit structured logs for every request.
Expose Prometheus metrics on
/metrics.Create traces that show request flow and downstream latency.
That design-first approach is the same mindset behind Why your Architecture should start with Questions, not boxes: the important operational choices are made before production traffic shows up.
Step 1: Start with the API contract
Most teams treat observability as an implementation detail, but some of the highest-value decisions belong in the API contract. A standard response shape gives clients a predictable model and gives operators a reliable way to correlate failures across systems.
Use a single response envelope like this:
type APIResponse struct { Data any `json:"data,omitempty"` Error string `json:"error,omitempty"` ErrorCode string `json:"error_code,omitempty"` RequestID string `json:"request_id"` }
And a small helper to write responses consistently:
func writeJSON(w http.ResponseWriter, status int, resp APIResponse) { w.Header().Set("Content-Type", "application/json") w.WriteHeader(status) _ = json.NewEncoder(w).Encode(resp) }
A good rule here is to standardise three things early:
A machine-readable error code like
validation_errororupstream_timeout.A human-readable error string.
A request ID returned to the caller.
That one decision makes dashboards easier to group, support tickets easier to trace, and logs easier to search. If you already design APIs for service boundaries, this also pairs well with One2N’s Backend Engineering practice page for framing APIs as long-lived operational surfaces.
Checkpoint
At this stage, your API does not need OTEL or Prometheus yet. It should simply return a stable JSON shape for both success and error paths.
Step 2: Bootstrap tracing and metrics early
If tracing and metrics are initialised late or inconsistently, gaps appear exactly where you need clarity most. Set up the OTEL SDK, propagators, and metrics exposure before any handler starts serving traffic.
A simple OTEL tracer bootstrap in Go can look like this:
package telemetry import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/propagation" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.26.0" ) func InitTracerProvider(ctx context.Context, serviceName string) (*sdktrace.TracerProvider, error) { exporter, err := otlptracehttp.New(ctx) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceName(serviceName), )), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator( propagation.NewCompositeTextMapPropagator( propagation.TraceContext{}, propagation.Baggage{}, ), ) return tp, nil }
Expose Prometheus metrics on a separate endpoint:
go func() { mux := http.NewServeMux() mux.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":9090", mux); err != nil { log.Fatal(err) } }()
This is a good place to add a note in the live post that /metrics can run on a separate internal port if you do not want to expose it on the public API surface.
Checkpoint
By now, your service should:
Start without handler changes.
Export traces to an OTLP-compatible backend.
Expose
/metricson port9090.
Step 3: Assign and propagate request IDs
You cannot debug distributed systems without consistent request identity. Correlation IDs are what let one customer complaint become a searchable trail across logs, traces, and downstream calls.
A small middleware can attach a request ID to every request:
package middleware import ( "context" "net/http" "github.com/google/uuid" ) type contextKey string const RequestIDKey contextKey = "request_id" func RequestID(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { requestID := r.Header.Get("X-Request-Id") if requestID == "" { requestID = uuid.NewString() } ctx := context.WithValue(r.Context(), RequestIDKey, requestID) w.Header().Set("X-Request-Id", requestID) next.ServeHTTP(w, r.WithContext(ctx)) }) } func GetRequestID(ctx context.Context) string { if v, ok := ctx.Value(RequestIDKey).(string); ok { return v } return "" }
Then use it in your handlers:
func getUserHandler(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) resp := APIResponse{ Data: map[string]any{ "id": "123", "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) }
This is the moment where the article should feel interactive. Make one request with curl, inspect the response headers, and confirm that the same request ID appears in both the response body and the X-Request-Id header.
Try it
curl -i http://localhost:8080/users/123
You should see:
X-Request-Idin the response headers.request_idin the JSON body.
Step 4: Make logs structured and useful
Text logs are fine when skimming one process locally, but they break down fast across services and incidents. Observable APIs need logs that are structured, queryable, and rich enough to explain a specific request.
Go’s slog is a good default:
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ Level: slog.LevelInfo, }))
Log with request context in handlers:
func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) userID := r.PathValue("id") logger.Info("fetching user", "request_id", requestID, "endpoint", "/users/{id}", "user_id", userID, "method", r.Method, ) resp := APIResponse{ Data: map[string]any{ "id": userID, "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) } }
The practical rule is simple:
Log in JSON.
Use consistent field names.
Add business context only when it helps explain behaviour.
Avoid two common mistakes:
Free-form log messages with the important values buried in strings.
Logging every tiny event with no consistent keys, which makes production search noisy and expensive.
Checkpoint
Make one request and verify that your logs include:
request_idendpointmethoduser_idwhere relevant.
Step 5: Add tracing at the boundaries
Metrics tell you that something is wrong; traces tell you where the time went. For APIs that call databases, caches, or third-party services, tracing stops being optional very quickly.
Start a span in the handler:
var tracer = otel.Tracer("users-api") func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { ctx, span := tracer.Start(r.Context(), "getUserHandler") defer span.End() userID := r.PathValue("id") requestID := middleware.GetRequestID(ctx) span.SetAttributes( attribute.String("http.method", r.Method), attribute.String("http.route", "/users/{id}"), attribute.String("app.request_id", requestID), attribute.String("app.user_id", userID), ) user, err := fetchUserFromDownstream(ctx, userID) if err != nil { span.RecordError(err) span.SetStatus(codes.Error, "downstream failure") writeJSON(w, http.StatusBadGateway, APIResponse{ Error: "failed to fetch user", ErrorCode: "upstream_timeout", RequestID: requestID, }) return } writeJSON(w, http.StatusOK, APIResponse{ Data: user, RequestID: requestID, }) } }
Wrap the downstream boundary too:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
The key idea is to instrument boundaries, not every tiny function. Handlers, external calls, DB queries, and cache lookups are usually enough to make traces useful without turning them noisy.
Checkpoint
By now, one request to /users/123 should produce:
A parent span for the handler.
A child span for the downstream fetch.
Shared request context between the trace and logs.
Step 6: Expose the right metrics
APIs produce a lot of numbers, but only a few matter every day. Start with SLI-shaped metrics: latency, throughput, and error rate.
A simple request counter and latency histogram in Prometheus might look like this:
var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "route", "status"}, ) httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request latency", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status"}, ) )
And wrap handlers with instrumentation:
func instrument(route string, next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() rw := &statusRecorder{ResponseWriter: w, statusCode: http.StatusOK} next.ServeHTTP(rw, r) status := strconv.Itoa(rw.statusCode) duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues(r.Method, route, status).Inc() httpRequestDuration.WithLabelValues(r.Method, route, status).Observe(duration) }) } type statusRecorder struct { http.ResponseWriter statusCode int } func (r *statusRecorder) WriteHeader(statusCode int) { r.statusCode = statusCode r.ResponseWriter.WriteHeader(statusCode) }
A useful note here is what not to label:
Do not label metrics with raw
user_idorrequest_id, because that creates high-cardinality metrics that become hard to store and query efficiently.Prefer stable, low-cardinality dimensions such as route, method, status, tenant tier, or region when needed.
This is also a natural spot to reference One2N’s CI/CD page or SRE Bootcamp, since these metrics become most useful when tied to release and reliability workflows.
Checkpoint
Hit the API a few times and then open /metrics. You should be able to find:
http_requests_totalhttp_request_duration_seconds.
Step 7: Break it on purpose
A follow-along post becomes more useful when readers can observe a failure, not just a healthy request path. So let’s simulate a flaky downstream dependency.
Change the downstream function:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() if userID == "500" { err := errors.New("downstream timeout") span.RecordError(err) span.SetStatus(codes.Error, err.Error()) time.Sleep(800 * time.Millisecond) return nil, err } time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
Now call:
curl -i http://localhost:8080/users/500
You should now see the three signals line up:
The response returns an error payload with
error_code=upstream_timeoutand arequest_id.The logs show the same request ID and an error path.
The trace shows the failing downstream span and the extra latency.
The metrics reflect a slower request and a 502 response.
This is also where a real-world internal cross-link helps. One2N’s How one parameter stalled Postgres Replication with Debezium for 3 Weeks is a good companion read because it shows why production debugging gets expensive when systems are operationally opaque.
Step 8: Make observability part of delivery
If observability only appears at the end of implementation, it will always feel like overhead. The better pattern is to make it part of how endpoints are shipped.
A practical team checklist looks like this:
Every new endpoint returns a request ID.
Every handler emits structured logs with stable keys.
Every external dependency call has a span around it.
Every critical path exposes latency and error metrics.
Dashboards and alerts live in version control alongside code where possible.
That “observability as code” mindset lines up well with One2N’s Gitops for Kafka in the real world, where operational control improves when systems are expressed explicitly rather than left implicit.
What you should have now
At this point, your small Go API should do five useful things:
Return consistent response envelopes.
Attach and propagate request IDs.
Emit structured JSON logs.
Expose Prometheus metrics.
Create traces that explain downstream latency and failure paths.
That does not make the service perfect. It does make it far easier to operate, debug, and evolve under real traffic.
Before you ship
Before you ship the next API or feature, check for these basics:
Every request has a correlation ID that flows across services and appears in responses.
Logs are structured, centralised, and enriched with useful request context.
Key SLIs such as latency, throughput, and error rate exist for critical endpoints.
Traces show boundary spans for downstream calls and capture failure states.
Metrics, alerts, and dashboard assumptions are treated as part of delivery, not post-release cleanup.
That is what “observable from day zero” really means in practice
Teams rarely set out to build black-box APIs, yet that is exactly how many of them behave in production. When latency spikes or errors appear, teams often know something is wrong but still cannot explain why quickly enough.
In this post, we’ll build a small Go API in a way that makes it observable from day zero, not after the first painful incident. We’ll use OpenTelemetry for traces and context propagation, Go’s slog for structured logs, and Prometheus for metrics, because the goal is not to collect more telemetry but to make production behaviour explainable.
If you want to get comfortable building production-grade REST APIs in Go first, start with our Go Bootcamp.
What observable means
Monitoring tells you whether an API is up; observability helps you explain strange behaviour you did not explicitly predict when you shipped it. For an API, that means being able to answer which requests are failing, how latency varies by endpoint or tenant, and what changed around the time things broke.
The key is that metrics, logs, and traces must reinforce each other. Metrics tell you that something moved, traces tell you where time went, and logs give you detailed request context.
The demo service
To make this a follow-along, we’ll build a tiny service with two endpoints:
GET /healthfor a quick health check.GET /users/{id}that simulates a downstream dependency call.
By the end, the service will:
Return a request ID in every response.
Emit structured logs for every request.
Expose Prometheus metrics on
/metrics.Create traces that show request flow and downstream latency.
That design-first approach is the same mindset behind Why your Architecture should start with Questions, not boxes: the important operational choices are made before production traffic shows up.
Step 1: Start with the API contract
Most teams treat observability as an implementation detail, but some of the highest-value decisions belong in the API contract. A standard response shape gives clients a predictable model and gives operators a reliable way to correlate failures across systems.
Use a single response envelope like this:
type APIResponse struct { Data any `json:"data,omitempty"` Error string `json:"error,omitempty"` ErrorCode string `json:"error_code,omitempty"` RequestID string `json:"request_id"` }
And a small helper to write responses consistently:
func writeJSON(w http.ResponseWriter, status int, resp APIResponse) { w.Header().Set("Content-Type", "application/json") w.WriteHeader(status) _ = json.NewEncoder(w).Encode(resp) }
A good rule here is to standardise three things early:
A machine-readable error code like
validation_errororupstream_timeout.A human-readable error string.
A request ID returned to the caller.
That one decision makes dashboards easier to group, support tickets easier to trace, and logs easier to search. If you already design APIs for service boundaries, this also pairs well with One2N’s Backend Engineering practice page for framing APIs as long-lived operational surfaces.
Checkpoint
At this stage, your API does not need OTEL or Prometheus yet. It should simply return a stable JSON shape for both success and error paths.
Step 2: Bootstrap tracing and metrics early
If tracing and metrics are initialised late or inconsistently, gaps appear exactly where you need clarity most. Set up the OTEL SDK, propagators, and metrics exposure before any handler starts serving traffic.
A simple OTEL tracer bootstrap in Go can look like this:
package telemetry import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/propagation" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.26.0" ) func InitTracerProvider(ctx context.Context, serviceName string) (*sdktrace.TracerProvider, error) { exporter, err := otlptracehttp.New(ctx) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceName(serviceName), )), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator( propagation.NewCompositeTextMapPropagator( propagation.TraceContext{}, propagation.Baggage{}, ), ) return tp, nil }
Expose Prometheus metrics on a separate endpoint:
go func() { mux := http.NewServeMux() mux.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":9090", mux); err != nil { log.Fatal(err) } }()
This is a good place to add a note in the live post that /metrics can run on a separate internal port if you do not want to expose it on the public API surface.
Checkpoint
By now, your service should:
Start without handler changes.
Export traces to an OTLP-compatible backend.
Expose
/metricson port9090.
Step 3: Assign and propagate request IDs
You cannot debug distributed systems without consistent request identity. Correlation IDs are what let one customer complaint become a searchable trail across logs, traces, and downstream calls.
A small middleware can attach a request ID to every request:
package middleware import ( "context" "net/http" "github.com/google/uuid" ) type contextKey string const RequestIDKey contextKey = "request_id" func RequestID(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { requestID := r.Header.Get("X-Request-Id") if requestID == "" { requestID = uuid.NewString() } ctx := context.WithValue(r.Context(), RequestIDKey, requestID) w.Header().Set("X-Request-Id", requestID) next.ServeHTTP(w, r.WithContext(ctx)) }) } func GetRequestID(ctx context.Context) string { if v, ok := ctx.Value(RequestIDKey).(string); ok { return v } return "" }
Then use it in your handlers:
func getUserHandler(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) resp := APIResponse{ Data: map[string]any{ "id": "123", "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) }
This is the moment where the article should feel interactive. Make one request with curl, inspect the response headers, and confirm that the same request ID appears in both the response body and the X-Request-Id header.
Try it
curl -i http://localhost:8080/users/123
You should see:
X-Request-Idin the response headers.request_idin the JSON body.
Step 4: Make logs structured and useful
Text logs are fine when skimming one process locally, but they break down fast across services and incidents. Observable APIs need logs that are structured, queryable, and rich enough to explain a specific request.
Go’s slog is a good default:
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ Level: slog.LevelInfo, }))
Log with request context in handlers:
func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) userID := r.PathValue("id") logger.Info("fetching user", "request_id", requestID, "endpoint", "/users/{id}", "user_id", userID, "method", r.Method, ) resp := APIResponse{ Data: map[string]any{ "id": userID, "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) } }
The practical rule is simple:
Log in JSON.
Use consistent field names.
Add business context only when it helps explain behaviour.
Avoid two common mistakes:
Free-form log messages with the important values buried in strings.
Logging every tiny event with no consistent keys, which makes production search noisy and expensive.
Checkpoint
Make one request and verify that your logs include:
request_idendpointmethoduser_idwhere relevant.
Step 5: Add tracing at the boundaries
Metrics tell you that something is wrong; traces tell you where the time went. For APIs that call databases, caches, or third-party services, tracing stops being optional very quickly.
Start a span in the handler:
var tracer = otel.Tracer("users-api") func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { ctx, span := tracer.Start(r.Context(), "getUserHandler") defer span.End() userID := r.PathValue("id") requestID := middleware.GetRequestID(ctx) span.SetAttributes( attribute.String("http.method", r.Method), attribute.String("http.route", "/users/{id}"), attribute.String("app.request_id", requestID), attribute.String("app.user_id", userID), ) user, err := fetchUserFromDownstream(ctx, userID) if err != nil { span.RecordError(err) span.SetStatus(codes.Error, "downstream failure") writeJSON(w, http.StatusBadGateway, APIResponse{ Error: "failed to fetch user", ErrorCode: "upstream_timeout", RequestID: requestID, }) return } writeJSON(w, http.StatusOK, APIResponse{ Data: user, RequestID: requestID, }) } }
Wrap the downstream boundary too:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
The key idea is to instrument boundaries, not every tiny function. Handlers, external calls, DB queries, and cache lookups are usually enough to make traces useful without turning them noisy.
Checkpoint
By now, one request to /users/123 should produce:
A parent span for the handler.
A child span for the downstream fetch.
Shared request context between the trace and logs.
Step 6: Expose the right metrics
APIs produce a lot of numbers, but only a few matter every day. Start with SLI-shaped metrics: latency, throughput, and error rate.
A simple request counter and latency histogram in Prometheus might look like this:
var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "route", "status"}, ) httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request latency", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status"}, ) )
And wrap handlers with instrumentation:
func instrument(route string, next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() rw := &statusRecorder{ResponseWriter: w, statusCode: http.StatusOK} next.ServeHTTP(rw, r) status := strconv.Itoa(rw.statusCode) duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues(r.Method, route, status).Inc() httpRequestDuration.WithLabelValues(r.Method, route, status).Observe(duration) }) } type statusRecorder struct { http.ResponseWriter statusCode int } func (r *statusRecorder) WriteHeader(statusCode int) { r.statusCode = statusCode r.ResponseWriter.WriteHeader(statusCode) }
A useful note here is what not to label:
Do not label metrics with raw
user_idorrequest_id, because that creates high-cardinality metrics that become hard to store and query efficiently.Prefer stable, low-cardinality dimensions such as route, method, status, tenant tier, or region when needed.
This is also a natural spot to reference One2N’s CI/CD page or SRE Bootcamp, since these metrics become most useful when tied to release and reliability workflows.
Checkpoint
Hit the API a few times and then open /metrics. You should be able to find:
http_requests_totalhttp_request_duration_seconds.
Step 7: Break it on purpose
A follow-along post becomes more useful when readers can observe a failure, not just a healthy request path. So let’s simulate a flaky downstream dependency.
Change the downstream function:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() if userID == "500" { err := errors.New("downstream timeout") span.RecordError(err) span.SetStatus(codes.Error, err.Error()) time.Sleep(800 * time.Millisecond) return nil, err } time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
Now call:
curl -i http://localhost:8080/users/500
You should now see the three signals line up:
The response returns an error payload with
error_code=upstream_timeoutand arequest_id.The logs show the same request ID and an error path.
The trace shows the failing downstream span and the extra latency.
The metrics reflect a slower request and a 502 response.
This is also where a real-world internal cross-link helps. One2N’s How one parameter stalled Postgres Replication with Debezium for 3 Weeks is a good companion read because it shows why production debugging gets expensive when systems are operationally opaque.
Step 8: Make observability part of delivery
If observability only appears at the end of implementation, it will always feel like overhead. The better pattern is to make it part of how endpoints are shipped.
A practical team checklist looks like this:
Every new endpoint returns a request ID.
Every handler emits structured logs with stable keys.
Every external dependency call has a span around it.
Every critical path exposes latency and error metrics.
Dashboards and alerts live in version control alongside code where possible.
That “observability as code” mindset lines up well with One2N’s Gitops for Kafka in the real world, where operational control improves when systems are expressed explicitly rather than left implicit.
What you should have now
At this point, your small Go API should do five useful things:
Return consistent response envelopes.
Attach and propagate request IDs.
Emit structured JSON logs.
Expose Prometheus metrics.
Create traces that explain downstream latency and failure paths.
That does not make the service perfect. It does make it far easier to operate, debug, and evolve under real traffic.
Before you ship
Before you ship the next API or feature, check for these basics:
Every request has a correlation ID that flows across services and appears in responses.
Logs are structured, centralised, and enriched with useful request context.
Key SLIs such as latency, throughput, and error rate exist for critical endpoints.
Traces show boundary spans for downstream calls and capture failure states.
Metrics, alerts, and dashboard assumptions are treated as part of delivery, not post-release cleanup.
That is what “observable from day zero” really means in practice
Teams rarely set out to build black-box APIs, yet that is exactly how many of them behave in production. When latency spikes or errors appear, teams often know something is wrong but still cannot explain why quickly enough.
In this post, we’ll build a small Go API in a way that makes it observable from day zero, not after the first painful incident. We’ll use OpenTelemetry for traces and context propagation, Go’s slog for structured logs, and Prometheus for metrics, because the goal is not to collect more telemetry but to make production behaviour explainable.
If you want to get comfortable building production-grade REST APIs in Go first, start with our Go Bootcamp.
What observable means
Monitoring tells you whether an API is up; observability helps you explain strange behaviour you did not explicitly predict when you shipped it. For an API, that means being able to answer which requests are failing, how latency varies by endpoint or tenant, and what changed around the time things broke.
The key is that metrics, logs, and traces must reinforce each other. Metrics tell you that something moved, traces tell you where time went, and logs give you detailed request context.
The demo service
To make this a follow-along, we’ll build a tiny service with two endpoints:
GET /healthfor a quick health check.GET /users/{id}that simulates a downstream dependency call.
By the end, the service will:
Return a request ID in every response.
Emit structured logs for every request.
Expose Prometheus metrics on
/metrics.Create traces that show request flow and downstream latency.
That design-first approach is the same mindset behind Why your Architecture should start with Questions, not boxes: the important operational choices are made before production traffic shows up.
Step 1: Start with the API contract
Most teams treat observability as an implementation detail, but some of the highest-value decisions belong in the API contract. A standard response shape gives clients a predictable model and gives operators a reliable way to correlate failures across systems.
Use a single response envelope like this:
type APIResponse struct { Data any `json:"data,omitempty"` Error string `json:"error,omitempty"` ErrorCode string `json:"error_code,omitempty"` RequestID string `json:"request_id"` }
And a small helper to write responses consistently:
func writeJSON(w http.ResponseWriter, status int, resp APIResponse) { w.Header().Set("Content-Type", "application/json") w.WriteHeader(status) _ = json.NewEncoder(w).Encode(resp) }
A good rule here is to standardise three things early:
A machine-readable error code like
validation_errororupstream_timeout.A human-readable error string.
A request ID returned to the caller.
That one decision makes dashboards easier to group, support tickets easier to trace, and logs easier to search. If you already design APIs for service boundaries, this also pairs well with One2N’s Backend Engineering practice page for framing APIs as long-lived operational surfaces.
Checkpoint
At this stage, your API does not need OTEL or Prometheus yet. It should simply return a stable JSON shape for both success and error paths.
Step 2: Bootstrap tracing and metrics early
If tracing and metrics are initialised late or inconsistently, gaps appear exactly where you need clarity most. Set up the OTEL SDK, propagators, and metrics exposure before any handler starts serving traffic.
A simple OTEL tracer bootstrap in Go can look like this:
package telemetry import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/propagation" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.26.0" ) func InitTracerProvider(ctx context.Context, serviceName string) (*sdktrace.TracerProvider, error) { exporter, err := otlptracehttp.New(ctx) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceName(serviceName), )), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator( propagation.NewCompositeTextMapPropagator( propagation.TraceContext{}, propagation.Baggage{}, ), ) return tp, nil }
Expose Prometheus metrics on a separate endpoint:
go func() { mux := http.NewServeMux() mux.Handle("/metrics", promhttp.Handler()) if err := http.ListenAndServe(":9090", mux); err != nil { log.Fatal(err) } }()
This is a good place to add a note in the live post that /metrics can run on a separate internal port if you do not want to expose it on the public API surface.
Checkpoint
By now, your service should:
Start without handler changes.
Export traces to an OTLP-compatible backend.
Expose
/metricson port9090.
Step 3: Assign and propagate request IDs
You cannot debug distributed systems without consistent request identity. Correlation IDs are what let one customer complaint become a searchable trail across logs, traces, and downstream calls.
A small middleware can attach a request ID to every request:
package middleware import ( "context" "net/http" "github.com/google/uuid" ) type contextKey string const RequestIDKey contextKey = "request_id" func RequestID(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { requestID := r.Header.Get("X-Request-Id") if requestID == "" { requestID = uuid.NewString() } ctx := context.WithValue(r.Context(), RequestIDKey, requestID) w.Header().Set("X-Request-Id", requestID) next.ServeHTTP(w, r.WithContext(ctx)) }) } func GetRequestID(ctx context.Context) string { if v, ok := ctx.Value(RequestIDKey).(string); ok { return v } return "" }
Then use it in your handlers:
func getUserHandler(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) resp := APIResponse{ Data: map[string]any{ "id": "123", "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) }
This is the moment where the article should feel interactive. Make one request with curl, inspect the response headers, and confirm that the same request ID appears in both the response body and the X-Request-Id header.
Try it
curl -i http://localhost:8080/users/123
You should see:
X-Request-Idin the response headers.request_idin the JSON body.
Step 4: Make logs structured and useful
Text logs are fine when skimming one process locally, but they break down fast across services and incidents. Observable APIs need logs that are structured, queryable, and rich enough to explain a specific request.
Go’s slog is a good default:
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{ Level: slog.LevelInfo, }))
Log with request context in handlers:
func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { requestID := middleware.GetRequestID(r.Context()) userID := r.PathValue("id") logger.Info("fetching user", "request_id", requestID, "endpoint", "/users/{id}", "user_id", userID, "method", r.Method, ) resp := APIResponse{ Data: map[string]any{ "id": userID, "name": "Jane Doe", }, RequestID: requestID, } writeJSON(w, http.StatusOK, resp) } }
The practical rule is simple:
Log in JSON.
Use consistent field names.
Add business context only when it helps explain behaviour.
Avoid two common mistakes:
Free-form log messages with the important values buried in strings.
Logging every tiny event with no consistent keys, which makes production search noisy and expensive.
Checkpoint
Make one request and verify that your logs include:
request_idendpointmethoduser_idwhere relevant.
Step 5: Add tracing at the boundaries
Metrics tell you that something is wrong; traces tell you where the time went. For APIs that call databases, caches, or third-party services, tracing stops being optional very quickly.
Start a span in the handler:
var tracer = otel.Tracer("users-api") func getUserHandler(logger *slog.Logger) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { ctx, span := tracer.Start(r.Context(), "getUserHandler") defer span.End() userID := r.PathValue("id") requestID := middleware.GetRequestID(ctx) span.SetAttributes( attribute.String("http.method", r.Method), attribute.String("http.route", "/users/{id}"), attribute.String("app.request_id", requestID), attribute.String("app.user_id", userID), ) user, err := fetchUserFromDownstream(ctx, userID) if err != nil { span.RecordError(err) span.SetStatus(codes.Error, "downstream failure") writeJSON(w, http.StatusBadGateway, APIResponse{ Error: "failed to fetch user", ErrorCode: "upstream_timeout", RequestID: requestID, }) return } writeJSON(w, http.StatusOK, APIResponse{ Data: user, RequestID: requestID, }) } }
Wrap the downstream boundary too:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
The key idea is to instrument boundaries, not every tiny function. Handlers, external calls, DB queries, and cache lookups are usually enough to make traces useful without turning them noisy.
Checkpoint
By now, one request to /users/123 should produce:
A parent span for the handler.
A child span for the downstream fetch.
Shared request context between the trace and logs.
Step 6: Expose the right metrics
APIs produce a lot of numbers, but only a few matter every day. Start with SLI-shaped metrics: latency, throughput, and error rate.
A simple request counter and latency histogram in Prometheus might look like this:
var ( httpRequestsTotal = promauto.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "route", "status"}, ) httpRequestDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request latency", Buckets: prometheus.DefBuckets, }, []string{"method", "route", "status"}, ) )
And wrap handlers with instrumentation:
func instrument(route string, next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { start := time.Now() rw := &statusRecorder{ResponseWriter: w, statusCode: http.StatusOK} next.ServeHTTP(rw, r) status := strconv.Itoa(rw.statusCode) duration := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues(r.Method, route, status).Inc() httpRequestDuration.WithLabelValues(r.Method, route, status).Observe(duration) }) } type statusRecorder struct { http.ResponseWriter statusCode int } func (r *statusRecorder) WriteHeader(statusCode int) { r.statusCode = statusCode r.ResponseWriter.WriteHeader(statusCode) }
A useful note here is what not to label:
Do not label metrics with raw
user_idorrequest_id, because that creates high-cardinality metrics that become hard to store and query efficiently.Prefer stable, low-cardinality dimensions such as route, method, status, tenant tier, or region when needed.
This is also a natural spot to reference One2N’s CI/CD page or SRE Bootcamp, since these metrics become most useful when tied to release and reliability workflows.
Checkpoint
Hit the API a few times and then open /metrics. You should be able to find:
http_requests_totalhttp_request_duration_seconds.
Step 7: Break it on purpose
A follow-along post becomes more useful when readers can observe a failure, not just a healthy request path. So let’s simulate a flaky downstream dependency.
Change the downstream function:
func fetchUserFromDownstream(ctx context.Context, userID string) (map[string]any, error) { ctx, span := tracer.Start(ctx, "fetchUserFromDownstream") defer span.End() if userID == "500" { err := errors.New("downstream timeout") span.RecordError(err) span.SetStatus(codes.Error, err.Error()) time.Sleep(800 * time.Millisecond) return nil, err } time.Sleep(120 * time.Millisecond) return map[string]any{ "id": userID, "name": "Jane Doe", }, nil }
Now call:
curl -i http://localhost:8080/users/500
You should now see the three signals line up:
The response returns an error payload with
error_code=upstream_timeoutand arequest_id.The logs show the same request ID and an error path.
The trace shows the failing downstream span and the extra latency.
The metrics reflect a slower request and a 502 response.
This is also where a real-world internal cross-link helps. One2N’s How one parameter stalled Postgres Replication with Debezium for 3 Weeks is a good companion read because it shows why production debugging gets expensive when systems are operationally opaque.
Step 8: Make observability part of delivery
If observability only appears at the end of implementation, it will always feel like overhead. The better pattern is to make it part of how endpoints are shipped.
A practical team checklist looks like this:
Every new endpoint returns a request ID.
Every handler emits structured logs with stable keys.
Every external dependency call has a span around it.
Every critical path exposes latency and error metrics.
Dashboards and alerts live in version control alongside code where possible.
That “observability as code” mindset lines up well with One2N’s Gitops for Kafka in the real world, where operational control improves when systems are expressed explicitly rather than left implicit.
What you should have now
At this point, your small Go API should do five useful things:
Return consistent response envelopes.
Attach and propagate request IDs.
Emit structured JSON logs.
Expose Prometheus metrics.
Create traces that explain downstream latency and failure paths.
That does not make the service perfect. It does make it far easier to operate, debug, and evolve under real traffic.
Before you ship
Before you ship the next API or feature, check for these basics:
Every request has a correlation ID that flows across services and appears in responses.
Logs are structured, centralised, and enriched with useful request context.
Key SLIs such as latency, throughput, and error rate exist for critical endpoints.
Traces show boundary spans for downstream calls and capture failure states.
Metrics, alerts, and dashboard assumptions are treated as part of delivery, not post-release cleanup.
That is what “observable from day zero” really means in practice
In this post
In this post
Section
Share
Share
In this post
section
Share
Keywords
Observability, Go, OpenTelemetry, Prometheus, API











