Services

Resources

Company

Our Work

Blog

Schedule a Meet

How we cut analytics costs by 67% by migrating from AWS Glue to self-hosted ClickHouse for a fintech processing 30K transactions daily.

Context.

The customer is a fintech company in the digital payments space. They process around 30K transactions daily with multiple terabytes of data across their analytics infrastructure.

Their analytics platform was built entirely on AWS managed services: AWS Glue for ETL, Amazon Athena for querying, and AWS DMS for CDC replication from PostgreSQL. The existing analytics platform worked fine initially. It was reliable, easy to manage. But as data volumes grew, monthly costs crossed from $13,000 and were heading toward $35,000 with projected 3x growth. The analytics solution did not justify the cost to the business. One2N helped to migrate them to a self-hosted open-source stack and cut their costs by 80% while maintaining production-grade reliability.

Problem Statement.

Escalating infrastructure costs: Analytics spend exceeding $13,000/month with no optimization levers available in fully managed services.

Vendor lock-in: Entire pipeline built on AWS-specific services, limiting negotiating power and strategic flexibility.

Scalability concerns: Projected 3x volume growth would push monthly costs beyond $35,000, unsustainable for unit economics.

No operational control: Can't tune compression, storage optimization, or resource allocation with managed services.

Outcome/Impact.

67%

Cost reduction

$40K

Annual Savings

99.9%+

Uptime

4.2s

P95 Query Latency

Cost savings: Monthly spend dropped from $13,000 to $2,000. That's $11,000/month or roughly $120K year.

Production-ready reliability: 1-shard, 2-replica ClickHouse cluster with 99.9%+ availability across availability zones.

Query performance maintained: P95 latency of 4.2 seconds on par with what they had on Athena.

Full operational control: Team now owns compression algorithms, indexing strategies, and resource allocation.

No more lock-in: Entire stack is open-source (ClickHouse, PeerDB, Prefect, dbt). It gives flexibility to the customer to move anywhere.

Solution.

The original setup had DMS pulling CDC from PostgreSQL, dumping to S3, Glue processing the data, Athena for queries. Reliable, but expensive at scale.

We broke the migration into four phases, tackling the biggest cost drivers first.

Phase 1: Replace AWS DMS with PeerDB (~$1,500 → ~$30/month)

DMS was eating $1,500/month just for CDC replication. We swapped it for PeerDB an open-source CDC tool built specifically for PostgreSQL → ClickHouse pipelines. Same 30-second sync intervals, 98% cost reduction. Sometimes the specialized tool just wins.

Phase 2: Build the ClickHouse cluster (~$4,000 → ~$1,600/month)

For the analytical database, we went with ClickHouse. The key decision here was keeping it simple: 1 shard, 2 replicas. No distributed query complexity.

The setup:

2× EC2 r6g.4xlarge (16 vCPUs, 128GB RAM) on Graviton for better price-performance
3× EC2 t4g.medium for ClickHouse Keeper coordination across AZs
EBS gp3 starting at 250GB, scaling to 1.5TB per node

We ran compression tests on their actual data—ZSTD gave us 5.4x average compression. Their projected 27TB of raw data fits comfortably in ~5TB compressed. Single-node territory for years. No need to over-engineer with sharding yet.

Phase 3: Replace Glue with Prefect + dbt (~$2,800 → ~$30/month)

Glue was the other big spend. We replaced it with Prefect for orchestration and dbt for SQL transformations. The team now has version-controlled, testable transformation logic instead of Spark jobs running in a black box.

Phase 4: Observability with OpenTelemetry + Last9

Moving off managed services means you own the monitoring too. We set up comprehensive metrics tracking with OpenTelemetry and Last9 for visualization and alerting. Same visibility they had with CloudWatch, fraction of the cost.

Why these choices?

Why no sharding? Their data fits in a single node with room to grow 3x. Sharding adds distributed query complexity, rebalancing headaches, and operational overhead. When they need it, the path is clear. But not yet.

Why Graviton? ~20% better price-performance than x86 for ClickHouse workloads. Easy win.

Why all open-source? No licensing costs, full operational transparency, and they're never locked in again. If they want to move clouds or go on-prem in certain regions, nothing stops them.

Tech stack used.

Analytical DB

Compute

AWS EC2 (Graviton r6g, t4g)

CDC Replication

Storage

AWS EBS gp3

Orchestration

Transformations

Observability

Analytical DB

Orchestration

Observability

Storage

AWS EBS gp3

CDC Replication

Transformations

Compute

AWS EC2 (Graviton r6g, t4g)

Analytical DB

CDC Replication

Orchestration

Transformations

Observability

Compute

AWS EC2 (Graviton r6g, t4g)

Storage

AWS EBS gp3

Analytical DB

Transformations

Storage

AWS EBS gp3

CDC Replication

Observability

Orchestration

Compute

AWS EC2 (Graviton r6g, t4g)

Analytical DB

Transformations

Storage

AWS EBS gp3

CDC Replication

Observability

Orchestration

Compute

AWS EC2 (Graviton r6g, t4g)

Take a look at our other work.

Read Case Study

Zero downtime MySQL schema migrations for 400M row table

DB schema migrations on large tables (400+ million rows or 150+ GB in size for a single table) caused replication lag and impacted latencies. Adding indexes on large tables also resulted in replication lag and degraded query performance. Developers had to wait months to roll out their features which needed schema changes.

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Read Case Study

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Read Case Study

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Read Case Study

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Read Case Study

Zero downtime MySQL schema migrations for 400M row table

400 M

Biggest table migrated

migrations per month

Downtime due to schema migrations

Read Case Study

Backup and recovery solution for SIEM data at Terabyte scale

The client is a global MSSP (Managed Security Service Provider) company. They host and manage a popular Security Information and Events Management (SIEM) platform for detecting, monitoring, and responding to cybersecurity threats and incidents. Their system handles 100s of tenants and more than 1.5 Terabytes of security events and logs data daily.

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Read Case Study

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Read Case Study

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Read Case Study

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Read Case Study

Backup and recovery solution for SIEM data at Terabyte scale

1.5TB

Dataset handled daily

100+

tenants using backup restore solution

$200

Running cost per month

Other Case Studies

Blogs.

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Read Blog

How Queueing Theory Makes Systems Reliable

One2N

Team

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

December 17, 2025 | 2 min read

Read Blog

How to read SRE graphs without lying to yourself

One2N

Team

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

December 10, 2025 | 3 min read

Read Blog

Error Budget Calculation: Downtime Minutes for every SLO

One2N

Team

Turn your SLO into something you can argue about in a meeting: this guide shows how to convert 99.9% into 43 real minutes of downtime, read burn rate, push back on “five nines,” and decide when to ship or hit pause.

December 3, 2025 | 3 min read

Blogs

Blogs.

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Blogs

Blogs.

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Blogs

Blogs.

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Read Blog

How Queueing Theory Makes Systems Reliable

One2N

Team

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

December 17, 2025 | 2 min read

Read Blog

How to read SRE graphs without lying to yourself

One2N

Team

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

December 10, 2025 | 3 min read

Read Blog

Error Budget Calculation: Downtime Minutes for every SLO

One2N

Team

Turn your SLO into something you can argue about in a meeting: this guide shows how to convert 99.9% into 43 real minutes of downtime, read burn rate, push back on “five nines,” and decide when to ship or hit pause.

December 3, 2025 | 3 min read

Blogs

Blogs.

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Read Blog

How Queueing Theory Makes Systems Reliable

One2N

Team

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

December 17, 2025 | 2 min read

Read Blog

How to read SRE graphs without lying to yourself

One2N

Team

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

December 10, 2025 | 3 min read

Read Blog

Error Budget Calculation: Downtime Minutes for every SLO

One2N

Team

Turn your SLO into something you can argue about in a meeting: this guide shows how to convert 99.9% into 43 real minutes of downtime, read burn rate, push back on “five nines,” and decide when to ship or hit pause.

December 3, 2025 | 3 min read

Blogs