Services

Resources

Company

Migrating from AWS Glue to Self-Hosted ClickHouse: Achieving 63% cost reduction with production-ready analytics.

Context.

We recently worked with a fintech client operating in the digital payments space. As their transaction volumes grew, so did their data infrastructure costs. Their analytics platform was built entirely on AWS managed services: AWS Glue for ETL, Amazon Athena for querying, and AWS DMS for data replication. While this serverless approach provided reliability and ease of management, the monthly costs were becoming unsustainable.

The client's operations and analytics teams relied heavily on these data pipelines to monitor transaction patterns, detect anomalies, track business metrics, and generate regulatory reports. As data volumes crossed multiple terabytes, their AWS bills for analytics infrastructure alone exceeded $13,000 per month. The finance team mandated a comprehensive review of the analytics stack to identify cost optimization opportunities without compromising on reliability or performance.

Problem Statement.

The existing AWS-managed analytics infrastructure presented a critical cost challenge:

Escalating Infrastructure Costs: Monthly AWS costs for the analytics pipeline exceeded $13,000, with the largest components being:

  • AWS Glue: ~$6,400/month for crawlers and ETL job execution

  • AWS DMS: ~$3,500/month for CDC replication instances

  • Amazon S3: ~$3,200/month for raw and processed data storage

  • Amazon Athena: Relatively modest at ~$20/month for query costs

Limited Cost Control: As a fully managed service stack, cost optimization opportunities were limited. The client couldn't tune compression algorithms, optimize storage formats beyond standard options, or control resource allocation at a granular level.

Scalability Cost Concerns: With transaction volumes projected to grow 3x over the next 18 months, extrapolating current costs would push monthly analytics infrastructure spend beyond $35,000—a figure that would consume a significant portion of the data team's budget.

Vendor Lock-in: The tight coupling with AWS managed services created concerns about portability and negotiating leverage for enterprise agreements.

Outcome/Impact.

After migrating to a self-hosted ClickHouse cluster on EC2 with open-source tooling, we achieved significant cost reduction while maintaining operational standards:

  • ~90% reduction in production infrastructure costs: From estimated $5,000-6,000/month for production analytics to approximately $1,600-1,900/month (depending on storage phase)

  • Maintained query performance: Similar query latency to the previous Athena-based setup (2-5 seconds for typical analytical queries)

  • Maintained high availability: 99.9%+ availability through multi-replica architecture, comparable to AWS managed services

  • Preserved data freshness: CDC-based replication continued to provide near real-time data sync

  • Gained operational control: Full control over compression, indexing, and resource allocation

  • Reduced vendor lock-in: Open-source stack provides portability and flexibility

Solution.

The Existing AWS-Based Setup

The client's original architecture was built entirely on AWS managed services:

PostgreSQL (Production DB) ↓ AWS DMS (CDC) ↓ S3 (Raw Data - CSV) ↓ AWS Glue Crawlers ↓ AWS Glue ETL Jobs ↓ S3 (Data Marts - Parquet) ↓ Amazon Athena (Queries) ↓ Analytics Dashboard

Data Flow:

  1. Raw Data Ingestion: AWS Database Migration Service (DMS) continuously synced transactional data from production PostgreSQL databases to S3 as CSV files. This provided change data capture (CDC) capabilities with data appearing in S3 within minutes.

  2. Data Processing: AWS Glue crawlers automatically discovered schema changes and updated the Glue Data Catalog. Glue ETL jobs ran on schedules (hourly and daily) to clean the raw data, perform transformations, join datasets, and create curated data marts stored as Parquet files in S3.

  3. Analytics Layer: Analytics and operations teams used Amazon Athena to query the data marts. Athena provided serverless, on-demand query capabilities with reasonable performance for their analytical workloads.

Cost Analysis of AWS Setup:

The actual costs shown in AWS Cost Explorer included both development and production environments, as well as S3 storage for various use cases beyond analytics:

  • AWS Glue: $6,422.79/month (crawlers + ETL job compute across all environments)

  • AWS DMS: $3,484.57/month (replication instances for multiple databases)

  • Amazon S3: $3,171.07/month (raw data, processed data, backups, and other artifacts)

  • Amazon Athena: $20.04/month (query execution costs)

Estimated Production Analytics Costs:

Isolating just the production analytics pipeline (approximately 40-50% of total usage):

  • AWS Glue (production ETL): ~$2,800/month

  • AWS DMS (production CDC): ~$1,500/month

  • Amazon S3 (analytics data): ~$600/month

  • Amazon Athena: ~$20/month

  • Estimated Production Total: $4,900-5,000/month

While this setup provided reliability and required minimal operational overhead, the costs were dominated by Glue's compute charges and DMS replication instance costs. For the production analytics pipeline alone, costs were estimated at $4,900-5,000/month. The serverless pricing model meant costs scaled linearly with data volumes and job execution frequency.

Rationale for Migrating to ClickHouse

After evaluating several alternatives including staying with AWS managed services, migrating to Amazon Redshift, and various open-source OLAP databases, we chose self-hosted ClickHouse for several compelling reasons:

Dramatic Cost Reduction Potential: ClickHouse's columnar storage with aggressive compression (typically 8-10x compression ratios) combined with running on right-sized EC2 instances offered predictable, fixed costs. Preliminary calculations suggested we could reduce infrastructure costs by 90%+ while maintaining similar capabilities.

Operational Control and Optimization: Self-hosting gave complete control over compression algorithms, memory allocation, storage configuration, and query optimization strategies. This level of control was impossible with fully managed services.

Production-Ready High Availability: ClickHouse's native replication capabilities allowed us to build a highly available system with multi-replica architecture, providing fault tolerance comparable to AWS managed services.

Open-Source Ecosystem: ClickHouse integrates well with modern open-source data tools like dbt, Prefect, and various CDC solutions. This reduced licensing costs and eliminated vendor lock-in.

Performance Characteristics: ClickHouse is purpose-built for OLAP workloads with features like vectorized query execution, sparse indexing, and aggressive data skipping that could match or exceed Athena's performance.

Proven at Scale: ClickHouse powers analytics at companies like Cloudflare, Uber, and eBay, demonstrating its production readiness at extreme scale.

The New ClickHouse Architecture

We designed a production-ready setup that balanced high availability with cost efficiency:

PostgreSQL (Production DB) ↓ PeerDB (CDC Sync) ↓ ClickHouse Cluster (Raw Data) - 1 Shard, 2 Replicas - EC2 r6g.xlarge instances - Dedicated EBS gp3 volumes ↓ Prefect (Orchestration) ↓ dbt (Transformations) ↓ ClickHouse (Data Marts) ↓ Analytics Dashboard ↓ OpenTelemetry + Last9 (Monitoring & Observability)

1. ClickHouse Cluster Configuration: 1 Shard, 2 Replicas (1S2R)

We implemented a 1-shard, 2-replica cluster architecture that provided production-grade high availability:

Why 1 Shard?

  • Current and projected data volume (up to 5TB compressed over 18 months) fit comfortably within a single shard

  • Eliminated distributed query complexity and cross-node network overhead

  • Simplified operations, monitoring, and troubleshooting

  • Reduced potential failure points in the system

  • Made capacity planning straightforward

Why 2 Replicas?

  • Ensured zero downtime during maintenance, deployments, or node failures

  • Provided read scalability by load-balancing queries across replicas

  • Protected against data loss with synchronous replication via ClickHouse Keeper

  • Allowed rolling updates without service interruption

  • Maintained service availability even if one replica failed

Infrastructure Setup:

  • 2x EC2 r6g.4xlarge instances (16 vCPUs, 128 GB RAM each)

    • Graviton-based processors for better price-performance ratio

    • Deployed across two availability zones (ap-southeast-1b and 1c)

  • Dedicated EBS gp3 volumes: Started with 250GB, expandable to 1.5TB per instance

  • ClickHouse Keeper cluster: 3x t4g.medium instances for distributed coordination

    • Spread across availability zones for fault tolerance

    • Handles replication coordination and cluster configuration management

  • Network: Deployed in private subnets within existing VPC with proper security group configuration

  • Deployment: Docker-based deployment with persistent volumes for ease of management

ClickHouse Keeper Configuration:

ClickHouse Keeper replaced ZooKeeper as the coordination service. It's a lightweight, ClickHouse-native solution for managing distributed cluster state:

<clickhouse> <keeper_server> <tcp_port>9181</tcp_port> <server_id>1</server_id> <raft_configuration> <server> <id>1</id> <hostname>keeper-1.internal</hostname> <port>9234</port> </server> <server> <id>2</id> <hostname>keeper-2.internal</hostname> <port>9234</port> </server> <server> <id>3</id> <hostname>keeper-3.internal</hostname> <port>9234</port> </server> </raft_configuration> </keeper_server> </clickhouse>

Cluster Configuration:

<clickhouse> <remote_servers> <prod_cluster> <shard> <replica> <host>clickhouse-1.internal</host> <port>9000</port> </replica> <replica> <host>clickhouse-2.internal</host> <port>9000</port> </replica> </shard> </prod_cluster> </remote_servers> <zookeeper> <node> <host>keeper-1.internal</host> <port>9181</port> </node> <node> <host>keeper-2.internal</host> <port>9181</port> </node> <node> <host>keeper-3.internal</host> <port>9181</port> </node> </zookeeper> </clickhouse>

All tables use replicated engines from the MergeTree family for automatic replication. Different tables use different engine variants based on their use case:

-- Standard replicated table for append-only data CREATE TABLE raw_data.transactions ON CLUSTER prod_cluster ( transaction_id UUID, user_id UUID, amount Decimal(18,2), currency LowCardinality(String), transaction_timestamp DateTime64(3), status LowCardinality(String), merchant_id UUID ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/transactions', '{replica}') PARTITION BY toYYYYMM(transaction_timestamp) ORDER BY (transaction_timestamp, transaction_id) SETTINGS index_granularity = 8192; -- Version collapsed tree for data with updates CREATE TABLE data_marts.user_balance ON CLUSTER prod_cluster ( user_id UUID, balance Decimal(18,2), last_updated DateTime64(3), version UInt64, sign Int8 ) ENGINE = ReplicatedVersionedCollapsingMergeTree('/clickhouse/tables/{shard}/user_balance', '{replica}', sign, version) ORDER BY (user_id, version) PARTITION BY toYYYYMM(last_updated);

2. PeerDB for Real-Time Data Synchronization

We replaced AWS DMS with PeerDB, an open-source CDC tool optimized for PostgreSQL to ClickHouse replication:

Why PeerDB?

  • Native ClickHouse integration with efficient batch inserts

  • Better handling of schema evolution and data type mappings

  • Self-hosted on a single small EC2 instance, dramatically reducing replication costs

  • Maintains same data freshness as DMS (near real-time CDC)

  • Open-source with active community support

Cost Comparison:

  • AWS DMS (production): ~$1,500/month for replication instances

  • PeerDB on EC2 t3.medium: ~$30/month

  • Savings: ~$1,470/month (98% reduction in replication costs)

PeerDB Configuration:

# peer.yaml source: type: postgres host: production-db.example.com port: 5432 database: production user: replication_user ssl_mode: require destination: type: clickhouse host: clickhouse-1.internal port: 9000 database: raw_data cluster: prod_cluster sync_mode: cdc batch_size: 10000 sync_interval: 30s parallelism: 4 initial_snapshot: true

PeerDB creates raw data tables in ClickHouse using the distributed table pattern:

-- Local table on each replica CREATE TABLE raw_data.payments_local ON CLUSTER prod_cluster ( payment_id UUID, transaction_id UUID, payment_method LowCardinality(String), amount Decimal(18,2), created_at DateTime64(3), updated_at DateTime64(3) ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/payments', '{replica}') ORDER BY (created_at, payment_id); -- Distributed table for queries CREATE TABLE raw_data.payments ON CLUSTER prod_cluster AS raw_data.payments_local ENGINE = Distributed(prod_cluster, raw_data, payments_local, rand());

3. Prefect + dbt for Data Transformation Pipeline

We replaced AWS Glue ETL jobs with a modern open-source stack:

Prefect for workflow orchestration:

  • Scheduled job execution (replacing Glue job triggers)

  • Dependency management between transformation steps

  • Retry logic and comprehensive error handling

  • Real-time monitoring and alerting

  • Self-hosted on EC2 t3.medium instance

dbt for SQL-based transformations:

  • Version-controlled transformation logic in Git

  • Incremental model updates for efficiency

  • Built-in data quality testing

  • Auto-generated documentation

  • Much lighter weight than Glue's Spark-based ETL

Cost Comparison:

  • AWS Glue ETL (production): ~$2,800/month for job execution

  • Prefect on EC2: ~$30/month

  • Savings: ~$2,770/month (99% reduction in ETL compute costs)

Sample dbt Model:

-- models/marts/daily_transaction_summary.sql {{ config( materialized='incremental', engine='ReplicatedMergeTree(\\\\'/clickhouse/tables/{shard}/daily_transaction_summary\\\\', \\\\'{replica}\\\\')', order_by='(summary_date, merchant_id)', partition_by='toYYYYMM(summary_date)', unique_key='summary_date || \\\\'-\\\\' || toString(merchant_id)' ) }} SELECT toDate(transaction_timestamp) as summary_date, merchant_id, currency, count() as transaction_count, sum(amount) as total_volume, avg(amount) as avg_transaction_amount, countIf(status = 'completed') as successful_transactions, countIf(status = 'failed') as failed_transactions FROM {{ ref('cleaned_transactions') }} {% if is_incremental() %} WHERE transaction_timestamp > (SELECT max(summary_date) FROM {{ this }}) {% endif %} GROUP BY summary_date, merchant_id, currency

Prefect Workflow:

from prefect import flow, task from prefect.task_runners import ConcurrentTaskRunner import subprocess @task(retries=2, retry_delay_seconds=60) def run_dbt_model(model_name: str): """Execute a dbt model with retry logic""" result = subprocess.run( ["dbt", "run", "--models", model_name, "--profiles-dir", "./profiles"], capture_output=True, text=True ) if result.returncode != 0: raise Exception(f"dbt model {model_name} failed: {result.stderr}") return result.stdout @task(retries=2) def run_dbt_tests(model_name: str): """Run data quality tests""" result = subprocess.run( ["dbt", "test", "--models", model_name, "--profiles-dir", "./profiles"], capture_output=True, text=True ) if result.returncode != 0: raise Exception(f"Tests failed for {model_name}: {result.stderr}") return result.stdout @flow(task_runner=ConcurrentTaskRunner()) def daily_analytics_pipeline(): """Daily ETL pipeline replacing Glue jobs""" # Stage 1: Clean and standardize raw data cleaned_transactions = run_dbt_model("staging.cleaned_transactions") cleaned_users = run_dbt_model("staging.cleaned_users") cleaned_merchants = run_dbt_model("staging.cleaned_merchants") # Stage 2: Data quality checks run_dbt_tests("staging") # Stage 3: Build aggregated data marts daily_summary = run_dbt_model("marts.daily_transaction_summary") merchant_metrics = run_dbt_model("marts.merchant_performance_metrics") fraud_indicators = run_dbt_model("marts.fraud_risk_indicators") # Stage 4: Final validation run_dbt_tests("marts") if __name__ == "__main__": daily_analytics_pipeline()

4. Observability with OpenTelemetry and Last9

Unlike AWS managed services that provide built-in monitoring, we needed to implement comprehensive observability:

Monitoring Stack:

  • OpenTelemetry Collector: Deployed alongside ClickHouse servers

  • Prometheus receiver: Scrapes ClickHouse metrics endpoints

  • Last9: Cloud-based metrics backend for long-term storage and alerting

Key Metrics Tracked:

  • Query performance (latency, throughput, queue depth)

  • Memory utilization and cache hit rates

  • Disk I/O and storage capacity

  • Replication lag between replicas

  • Keeper cluster health

  • Table sizes and compression ratios

OpenTelemetry Configuration:

receivers: prometheus: config: scrape_configs: - job_name: 'clickhouse' scrape_interval: 30s static_configs: - targets: - 'clickhouse-1.internal:8001' - 'clickhouse-2.internal:8001' metric_relabel_configs: - source_labels: [__name__] regex: 'ClickHouse.*' action: keep exporters: otlp: endpoint: "otlp.last9.io:443" headers: Authorization: "Bearer ${LAST9_API_KEY}" service: pipelines: metrics: receivers: [prometheus] exporters: [otlp]

This provided visibility comparable to CloudWatch for AWS services, with additional flexibility for custom metrics and dashboards.

Capacity Planning: Right-Sizing the Cluster

Proper capacity planning was essential to achieve cost efficiency while ensuring the cluster could handle current workloads and future growth.

1. Memory-to-Data Ratio Guidelines

ClickHouse performs optimally when frequently accessed data can be cached in memory. Industry guidelines suggest:

  • Optimal ratio: 1:10 to 1:15 (memory to compressed data)

  • Acceptable ratio: 1:20 to 1:30 for mostly read-heavy workloads

  • Minimum ratio: 1:50 for cold storage / archival scenarios

For this use case with complex analytical queries and moderate query concurrency, we targeted a 1:12 ratio as a balance between performance and cost.

2. Compression Testing with Partial Migration

Rather than estimating, we conducted empirical compression testing by analyzing existing data already loaded into a test ClickHouse environment:

Test Methodology:

  1. Compared storage sizes between S3 (raw/processed data) and ClickHouse compressed storage

  2. Used ZSTD compression (ClickHouse's efficient default)

  3. Analyzed compression ratios across different data types

Current Data in S3:

  • Data mart tables: 2.5 TB

  • Clean tables in data lake: 2.5 TB

  • Raw tables in data lake: 425 GB

  • Total: 5.4 TB

Compression Results:

Data Type S3 Storage ClickHouse (ZSTD) Compression Ratio Data mart tables 2.5 TB 325 GB 7.69x Clean tables 2.5 TB 638 GB 3.92x Raw tables (via PeerDB) 425 GB 104 GB 4.09x Total 5.4 TB ~1.0 TB ~5.4x average

Key Insight: The compression ratios varied significantly based on data characteristics. Data mart tables (highly aggregated, optimized schemas) achieved the best compression at 7.69x, while raw tables with more varied data types achieved 4.09x compression. This validated that ClickHouse's columnar storage with ZSTD compression could handle the entire dataset in approximately 1TB of storage.

3. Total Data Storage Calculations

Based on empirical compression testing with actual data, we calculated storage requirements:

Current Data Volume:

  • Total data in S3: 5.4 TB

  • Measured compressed size in ClickHouse: ~1.0 TB

Growth Planning:

  • Current compressed storage: 1.0 TB

  • Expected growth over 18-24 months: Additional 300-500 GB

  • Target total capacity: 1.5 TB per node

Operational Overhead Buffer:

  • The 1.5TB allocation provides 50% buffer for:

    • Temporary tables during transformations

    • MergeTree parts during background merges

    • Replication queue overhead

    • Growth runway

Storage Allocation Decision:

  • Started with 250GB EBS gp3 volumes per node for initial migration

  • Planned expansion to 1.5TB volumes as data migration completes

  • Total cluster storage at full capacity: 3TB (2 replicas × 1.5TB)

  • Phased approach reduces initial costs during migration phase

4. Instance Selection and Memory Sizing

We followed ClickHouse's recommended memory-to-storage ratio guidelines while accounting for the workload characteristics:

Memory-to-Storage Ratio:

  • ClickHouse recommends 1:100 ratio for internal analytics tools with moderate query loads

  • For workloads with heavier concurrent queries, 1:50 ratio provides better performance

  • Our initial assessment targeted 1:100, with flexibility to scale up if needed

Memory Requirements:

  • Current uncompressed data in ClickHouse: ~170 GB

  • Initial compressed storage: ~1.0 TB

  • Using 1:100 ratio: 1 TB ÷ 100 = 10 GB minimum

  • Using 1:50 ratio: 1 TB ÷ 50 = 20 GB minimum

Initial Instance Type:

Started with r6g.xlarge instances (32 GB RAM each):

  • Provided comfortable buffer above minimum requirements

  • ClickHouse production recommendation: 32GB minimum

  • Cost-effective for initial deployment

  • Graviton-based for better price-performance

Scaling to Production Workload:

After observing actual query patterns and concurrency, we upgraded to r6g.4xlarge instances (128 GB RAM each):

  • 128 GB RAM per instance handles growing workload

  • 16 vCPUs support higher query parallelization

  • Better cache hit rates with more available memory

  • Provides headroom for 1:12 memory-to-data ratio (optimal for query-heavy workloads)

  • Graviton processors deliver 20-30% better price-performance vs Intel/AMD

5. Final Cluster Cost Breakdown

Component Specification Monthly Cost 2× EC2 r6g.4xlarge 16 vCPU, 128GB RAM ~$710 × 2 = $1,420 2× EBS gp3 (250GB initially, scaling to 1.5TB) 3000 IOPS $24 - $144 × 2 3× EC2 t4g.medium (Keeper) 2 vCPU, 4GB RAM $30.95 × 3 = $93 1× EC2 t3.medium (PeerDB) 2 vCPU, 4GB RAM ~$30 1× EC2 t3.medium (Prefect) 2 vCPU, 4GB RAM ~$30 Initial Infrastructure (250GB storage) ~$1,621/month Full Capacity (1.5TB storage) ~$1,861/month

Cost Comparison:

  • Previous AWS stack (production): ~$5,000/month

  • New ClickHouse stack (initial): ~$1,621/month

  • New ClickHouse stack (full capacity): ~$1,861/month

  • Monthly savings: $3,139 - $3,379

  • Annual savings: ~$38,000 - $40,000

  • Cost reduction: ~63-68%

6. Monitoring and Validation

  • *Post-migration, we validated capacity planning through comprehensive monitoring:

Key Performance Indicators (3-month average):

  • Memory utilization: 58% (well within capacity)

  • CPU utilization: 35% (headroom for query spikes)

  • Storage utilization: 42% at initial phase (growing as migration completes)

  • Query latency p95: 4.2 seconds (comparable to Athena)

  • Cache hit rate: 87% (excellent memory sizing)

  • Replication lag: avg 1.8 seconds, max 7 seconds

These metrics validated our capacity planning assumptions and confirmed sufficient headroom for projected growth.

Infrastructure as Code with Terraform

The entire cluster was provisioned using Terraform, ensuring reproducibility and easy disaster recovery:

Terraform Modules:

  • vpc: Network infrastructure (using existing VPC)

  • clickhouse_server: EC2 instances, EBS volumes, security groups

  • clickhouse_keeper: Keeper cluster instances across AZs

Key Configuration Variables:

environment = "production" clickhouse_shard_count = 1 clickhouse_replica_count = 2 clickhouse_server_instance_type = "r6g.4xlarge" clickhouse_server_ebs_size = 250 # Initial, scaling to 1500 clickhouse_keeper_count = 3 clickhouse_keeper_instance_type = "t4g.medium"

Automated Configuration Generation:

A custom script generates ClickHouse configuration files from Terraform outputs:

bash generate_configs.sh \\\\ --cluster-name prod \\\\ --clickhouse-user admin \\\\ --clickhouse-password ${CLICKHOUSE_PASSWORD}

This generates:

  • XML configuration files for each ClickHouse node

  • Docker Compose configurations

  • Environment files with credentials

  • Deployment scripts

Key Takeaways

Migrating from AWS managed services to self-hosted ClickHouse delivered a 63-68% cost reduction (~$38-40K annual savings) while maintaining operational standards. Several factors contributed to this success:

Cost Structure Matters: AWS managed services provide convenience but charge premium prices. For stable workloads with predictable patterns, self-hosting on EC2 with open-source tools can reduce costs by 60-70% without sacrificing capabilities.

Empirical Testing Over Estimation: Actual compression testing with ZSTD yielded varied results (3.92x to 7.69x depending on data type, ~5.4x average), directly impacting capacity planning and cost projections. Always test with real data before committing to infrastructure sizing.

High Availability Doesn't Require Complexity: The 1-shard, 2-replica architecture provided production-grade availability without the operational overhead of sharded deployments. For many use cases, vertical scaling within a single shard is simpler and more cost-effective than distributed sharding.

Open-Source Ecosystem Maturity: Modern open-source tools (PeerDB, Prefect, dbt, OpenTelemetry) rival managed services in functionality. The key is investing time in proper setup, monitoring, and operational procedures.

Phased Capacity Planning: Starting with 250GB storage and r6g.xlarge instances, then scaling to 1.5TB and r6g.4xlarge based on actual usage patterns prevented both over-provisioning at launch and performance issues during growth.

Memory-to-Storage Ratios Guide Scaling: Starting with 1:100 ratio (suitable for internal analytics) and scaling to more aggressive ratios (1:12) as query patterns evolved provided a clear framework for instance sizing decisions.

Observability Is Non-Negotiable: Unlike AWS managed services with built-in monitoring, self-hosted infrastructure requires explicit investment in observability. OpenTelemetry + Last9 provided comparable visibility to CloudWatch at a fraction of the cost.

Infrastructure as Code Pays Dividends: Terraform and automated configuration generation made the cluster reproducible, testable, and recoverable. This reduces operational risk and speeds up disaster recovery scenarios.

For organizations facing similar AWS cost challenges, this migration demonstrates that moving to self-hosted open-source infrastructure is not just feasible it can deliver substantial cost savings (60-70% reduction) while maintaining (or exceeding) the reliability, performance, and operational standards of managed services. The key is thorough planning, empirical testing with real data, phased scaling based on actual usage patterns, and commitment to building proper observability and automation.

Take a look at our other work.

Blogs.

Blogs.

Blogs.