Context.
We recently worked with a fintech client operating in the digital payments space. As their transaction volumes grew, so did their data infrastructure costs. Their analytics platform was built entirely on AWS managed services: AWS Glue for ETL, Amazon Athena for querying, and AWS DMS for data replication. While this serverless approach provided reliability and ease of management, the monthly costs were becoming unsustainable.
The client's operations and analytics teams relied heavily on these data pipelines to monitor transaction patterns, detect anomalies, track business metrics, and generate regulatory reports. As data volumes crossed multiple terabytes, their AWS bills for analytics infrastructure alone exceeded $13,000 per month. The finance team mandated a comprehensive review of the analytics stack to identify cost optimization opportunities without compromising on reliability or performance.
Problem Statement.
The existing AWS-managed analytics infrastructure presented a critical cost challenge:
Escalating Infrastructure Costs: Monthly AWS costs for the analytics pipeline exceeded $13,000, with the largest components being:
AWS Glue: ~$6,400/month for crawlers and ETL job execution
AWS DMS: ~$3,500/month for CDC replication instances
Amazon S3: ~$3,200/month for raw and processed data storage
Amazon Athena: Relatively modest at ~$20/month for query costs
Limited Cost Control: As a fully managed service stack, cost optimization opportunities were limited. The client couldn't tune compression algorithms, optimize storage formats beyond standard options, or control resource allocation at a granular level.
Scalability Cost Concerns: With transaction volumes projected to grow 3x over the next 18 months, extrapolating current costs would push monthly analytics infrastructure spend beyond $35,000—a figure that would consume a significant portion of the data team's budget.
Vendor Lock-in: The tight coupling with AWS managed services created concerns about portability and negotiating leverage for enterprise agreements.
Outcome/Impact.
After migrating to a self-hosted ClickHouse cluster on EC2 with open-source tooling, we achieved significant cost reduction while maintaining operational standards:
~90% reduction in production infrastructure costs: From estimated $5,000-6,000/month for production analytics to approximately $1,600-1,900/month (depending on storage phase)
Maintained query performance: Similar query latency to the previous Athena-based setup (2-5 seconds for typical analytical queries)
Maintained high availability: 99.9%+ availability through multi-replica architecture, comparable to AWS managed services
Preserved data freshness: CDC-based replication continued to provide near real-time data sync
Gained operational control: Full control over compression, indexing, and resource allocation
Reduced vendor lock-in: Open-source stack provides portability and flexibility
Solution.
The Existing AWS-Based Setup
The client's original architecture was built entirely on AWS managed services:
PostgreSQL (Production DB) ↓ AWS DMS (CDC) ↓ S3 (Raw Data - CSV) ↓ AWS Glue Crawlers ↓ AWS Glue ETL Jobs ↓ S3 (Data Marts - Parquet) ↓ Amazon Athena (Queries) ↓ Analytics Dashboard
Data Flow:
Raw Data Ingestion: AWS Database Migration Service (DMS) continuously synced transactional data from production PostgreSQL databases to S3 as CSV files. This provided change data capture (CDC) capabilities with data appearing in S3 within minutes.
Data Processing: AWS Glue crawlers automatically discovered schema changes and updated the Glue Data Catalog. Glue ETL jobs ran on schedules (hourly and daily) to clean the raw data, perform transformations, join datasets, and create curated data marts stored as Parquet files in S3.
Analytics Layer: Analytics and operations teams used Amazon Athena to query the data marts. Athena provided serverless, on-demand query capabilities with reasonable performance for their analytical workloads.
Cost Analysis of AWS Setup:
The actual costs shown in AWS Cost Explorer included both development and production environments, as well as S3 storage for various use cases beyond analytics:
AWS Glue: $6,422.79/month (crawlers + ETL job compute across all environments)
AWS DMS: $3,484.57/month (replication instances for multiple databases)
Amazon S3: $3,171.07/month (raw data, processed data, backups, and other artifacts)
Amazon Athena: $20.04/month (query execution costs)
Estimated Production Analytics Costs:
Isolating just the production analytics pipeline (approximately 40-50% of total usage):
AWS Glue (production ETL): ~$2,800/month
AWS DMS (production CDC): ~$1,500/month
Amazon S3 (analytics data): ~$600/month
Amazon Athena: ~$20/month
Estimated Production Total: $4,900-5,000/month
While this setup provided reliability and required minimal operational overhead, the costs were dominated by Glue's compute charges and DMS replication instance costs. For the production analytics pipeline alone, costs were estimated at $4,900-5,000/month. The serverless pricing model meant costs scaled linearly with data volumes and job execution frequency.
Rationale for Migrating to ClickHouse
After evaluating several alternatives including staying with AWS managed services, migrating to Amazon Redshift, and various open-source OLAP databases, we chose self-hosted ClickHouse for several compelling reasons:
Dramatic Cost Reduction Potential: ClickHouse's columnar storage with aggressive compression (typically 8-10x compression ratios) combined with running on right-sized EC2 instances offered predictable, fixed costs. Preliminary calculations suggested we could reduce infrastructure costs by 90%+ while maintaining similar capabilities.
Operational Control and Optimization: Self-hosting gave complete control over compression algorithms, memory allocation, storage configuration, and query optimization strategies. This level of control was impossible with fully managed services.
Production-Ready High Availability: ClickHouse's native replication capabilities allowed us to build a highly available system with multi-replica architecture, providing fault tolerance comparable to AWS managed services.
Open-Source Ecosystem: ClickHouse integrates well with modern open-source data tools like dbt, Prefect, and various CDC solutions. This reduced licensing costs and eliminated vendor lock-in.
Performance Characteristics: ClickHouse is purpose-built for OLAP workloads with features like vectorized query execution, sparse indexing, and aggressive data skipping that could match or exceed Athena's performance.
Proven at Scale: ClickHouse powers analytics at companies like Cloudflare, Uber, and eBay, demonstrating its production readiness at extreme scale.
The New ClickHouse Architecture
We designed a production-ready setup that balanced high availability with cost efficiency:
PostgreSQL (Production DB) ↓ PeerDB (CDC Sync) ↓ ClickHouse Cluster (Raw Data) - 1 Shard, 2 Replicas - EC2 r6g.xlarge instances - Dedicated EBS gp3 volumes ↓ Prefect (Orchestration) ↓ dbt (Transformations) ↓ ClickHouse (Data Marts) ↓ Analytics Dashboard ↓ OpenTelemetry + Last9 (Monitoring & Observability)
1. ClickHouse Cluster Configuration: 1 Shard, 2 Replicas (1S2R)
We implemented a 1-shard, 2-replica cluster architecture that provided production-grade high availability:
Why 1 Shard?
Current and projected data volume (up to 5TB compressed over 18 months) fit comfortably within a single shard
Eliminated distributed query complexity and cross-node network overhead
Simplified operations, monitoring, and troubleshooting
Reduced potential failure points in the system
Made capacity planning straightforward
Why 2 Replicas?
Ensured zero downtime during maintenance, deployments, or node failures
Provided read scalability by load-balancing queries across replicas
Protected against data loss with synchronous replication via ClickHouse Keeper
Allowed rolling updates without service interruption
Maintained service availability even if one replica failed
Infrastructure Setup:
2x EC2 r6g.4xlarge instances (16 vCPUs, 128 GB RAM each)
Graviton-based processors for better price-performance ratio
Deployed across two availability zones (ap-southeast-1b and 1c)
Dedicated EBS gp3 volumes: Started with 250GB, expandable to 1.5TB per instance
ClickHouse Keeper cluster: 3x t4g.medium instances for distributed coordination
Spread across availability zones for fault tolerance
Handles replication coordination and cluster configuration management
Network: Deployed in private subnets within existing VPC with proper security group configuration
Deployment: Docker-based deployment with persistent volumes for ease of management
ClickHouse Keeper Configuration:
ClickHouse Keeper replaced ZooKeeper as the coordination service. It's a lightweight, ClickHouse-native solution for managing distributed cluster state:
<clickhouse> <keeper_server> <tcp_port>9181</tcp_port> <server_id>1</server_id> <raft_configuration> <server> <id>1</id> <hostname>keeper-1.internal</hostname> <port>9234</port> </server> <server> <id>2</id> <hostname>keeper-2.internal</hostname> <port>9234</port> </server> <server> <id>3</id> <hostname>keeper-3.internal</hostname> <port>9234</port> </server> </raft_configuration> </keeper_server> </clickhouse>
Cluster Configuration:
<clickhouse> <remote_servers> <prod_cluster> <shard> <replica> <host>clickhouse-1.internal</host> <port>9000</port> </replica> <replica> <host>clickhouse-2.internal</host> <port>9000</port> </replica> </shard> </prod_cluster> </remote_servers> <zookeeper> <node> <host>keeper-1.internal</host> <port>9181</port> </node> <node> <host>keeper-2.internal</host> <port>9181</port> </node> <node> <host>keeper-3.internal</host> <port>9181</port> </node> </zookeeper> </clickhouse>
All tables use replicated engines from the MergeTree family for automatic replication. Different tables use different engine variants based on their use case:
-- Standard replicated table for append-only data CREATE TABLE raw_data.transactions ON CLUSTER prod_cluster ( transaction_id UUID, user_id UUID, amount Decimal(18,2), currency LowCardinality(String), transaction_timestamp DateTime64(3), status LowCardinality(String), merchant_id UUID ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/transactions', '{replica}') PARTITION BY toYYYYMM(transaction_timestamp) ORDER BY (transaction_timestamp, transaction_id) SETTINGS index_granularity = 8192; -- Version collapsed tree for data with updates CREATE TABLE data_marts.user_balance ON CLUSTER prod_cluster ( user_id UUID, balance Decimal(18,2), last_updated DateTime64(3), version UInt64, sign Int8 ) ENGINE = ReplicatedVersionedCollapsingMergeTree('/clickhouse/tables/{shard}/user_balance', '{replica}', sign, version) ORDER BY (user_id, version) PARTITION BY toYYYYMM(last_updated);
2. PeerDB for Real-Time Data Synchronization
We replaced AWS DMS with PeerDB, an open-source CDC tool optimized for PostgreSQL to ClickHouse replication:
Why PeerDB?
Native ClickHouse integration with efficient batch inserts
Better handling of schema evolution and data type mappings
Self-hosted on a single small EC2 instance, dramatically reducing replication costs
Maintains same data freshness as DMS (near real-time CDC)
Open-source with active community support
Cost Comparison:
AWS DMS (production): ~$1,500/month for replication instances
PeerDB on EC2 t3.medium: ~$30/month
Savings: ~$1,470/month (98% reduction in replication costs)
PeerDB Configuration:
# peer.yaml source: type: postgres host: production-db.example.com port: 5432 database: production user: replication_user ssl_mode: require destination: type: clickhouse host: clickhouse-1.internal port: 9000 database: raw_data cluster: prod_cluster sync_mode: cdc batch_size: 10000 sync_interval: 30s parallelism: 4 initial_snapshot: true
PeerDB creates raw data tables in ClickHouse using the distributed table pattern:
-- Local table on each replica CREATE TABLE raw_data.payments_local ON CLUSTER prod_cluster ( payment_id UUID, transaction_id UUID, payment_method LowCardinality(String), amount Decimal(18,2), created_at DateTime64(3), updated_at DateTime64(3) ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/payments', '{replica}') ORDER BY (created_at, payment_id); -- Distributed table for queries CREATE TABLE raw_data.payments ON CLUSTER prod_cluster AS raw_data.payments_local ENGINE = Distributed(prod_cluster, raw_data, payments_local, rand());
3. Prefect + dbt for Data Transformation Pipeline
We replaced AWS Glue ETL jobs with a modern open-source stack:
Prefect for workflow orchestration:
Scheduled job execution (replacing Glue job triggers)
Dependency management between transformation steps
Retry logic and comprehensive error handling
Real-time monitoring and alerting
Self-hosted on EC2 t3.medium instance
dbt for SQL-based transformations:
Version-controlled transformation logic in Git
Incremental model updates for efficiency
Built-in data quality testing
Auto-generated documentation
Much lighter weight than Glue's Spark-based ETL
Cost Comparison:
AWS Glue ETL (production): ~$2,800/month for job execution
Prefect on EC2: ~$30/month
Savings: ~$2,770/month (99% reduction in ETL compute costs)
Sample dbt Model:
-- models/marts/daily_transaction_summary.sql {{ config( materialized='incremental', engine='ReplicatedMergeTree(\\\\'/clickhouse/tables/{shard}/daily_transaction_summary\\\\', \\\\'{replica}\\\\')', order_by='(summary_date, merchant_id)', partition_by='toYYYYMM(summary_date)', unique_key='summary_date || \\\\'-\\\\' || toString(merchant_id)' ) }} SELECT toDate(transaction_timestamp) as summary_date, merchant_id, currency, count() as transaction_count, sum(amount) as total_volume, avg(amount) as avg_transaction_amount, countIf(status = 'completed') as successful_transactions, countIf(status = 'failed') as failed_transactions FROM {{ ref('cleaned_transactions') }} {% if is_incremental() %} WHERE transaction_timestamp > (SELECT max(summary_date) FROM {{ this }}) {% endif %} GROUP BY summary_date, merchant_id, currency
Prefect Workflow:
from prefect import flow, task from prefect.task_runners import ConcurrentTaskRunner import subprocess @task(retries=2, retry_delay_seconds=60) def run_dbt_model(model_name: str): """Execute a dbt model with retry logic""" result = subprocess.run( ["dbt", "run", "--models", model_name, "--profiles-dir", "./profiles"], capture_output=True, text=True ) if result.returncode != 0: raise Exception(f"dbt model {model_name} failed: {result.stderr}") return result.stdout @task(retries=2) def run_dbt_tests(model_name: str): """Run data quality tests""" result = subprocess.run( ["dbt", "test", "--models", model_name, "--profiles-dir", "./profiles"], capture_output=True, text=True ) if result.returncode != 0: raise Exception(f"Tests failed for {model_name}: {result.stderr}") return result.stdout @flow(task_runner=ConcurrentTaskRunner()) def daily_analytics_pipeline(): """Daily ETL pipeline replacing Glue jobs""" # Stage 1: Clean and standardize raw data cleaned_transactions = run_dbt_model("staging.cleaned_transactions") cleaned_users = run_dbt_model("staging.cleaned_users") cleaned_merchants = run_dbt_model("staging.cleaned_merchants") # Stage 2: Data quality checks run_dbt_tests("staging") # Stage 3: Build aggregated data marts daily_summary = run_dbt_model("marts.daily_transaction_summary") merchant_metrics = run_dbt_model("marts.merchant_performance_metrics") fraud_indicators = run_dbt_model("marts.fraud_risk_indicators") # Stage 4: Final validation run_dbt_tests("marts") if __name__ == "__main__": daily_analytics_pipeline()
4. Observability with OpenTelemetry and Last9
Unlike AWS managed services that provide built-in monitoring, we needed to implement comprehensive observability:
Monitoring Stack:
OpenTelemetry Collector: Deployed alongside ClickHouse servers
Prometheus receiver: Scrapes ClickHouse metrics endpoints
Last9: Cloud-based metrics backend for long-term storage and alerting
Key Metrics Tracked:
Query performance (latency, throughput, queue depth)
Memory utilization and cache hit rates
Disk I/O and storage capacity
Replication lag between replicas
Keeper cluster health
Table sizes and compression ratios
OpenTelemetry Configuration:
receivers: prometheus: config: scrape_configs: - job_name: 'clickhouse' scrape_interval: 30s static_configs: - targets: - 'clickhouse-1.internal:8001' - 'clickhouse-2.internal:8001' metric_relabel_configs: - source_labels: [__name__] regex: 'ClickHouse.*' action: keep exporters: otlp: endpoint: "otlp.last9.io:443" headers: Authorization: "Bearer ${LAST9_API_KEY}" service: pipelines: metrics: receivers: [prometheus] exporters: [otlp]
This provided visibility comparable to CloudWatch for AWS services, with additional flexibility for custom metrics and dashboards.
Capacity Planning: Right-Sizing the Cluster
Proper capacity planning was essential to achieve cost efficiency while ensuring the cluster could handle current workloads and future growth.
1. Memory-to-Data Ratio Guidelines
ClickHouse performs optimally when frequently accessed data can be cached in memory. Industry guidelines suggest:
Optimal ratio: 1:10 to 1:15 (memory to compressed data)
Acceptable ratio: 1:20 to 1:30 for mostly read-heavy workloads
Minimum ratio: 1:50 for cold storage / archival scenarios
For this use case with complex analytical queries and moderate query concurrency, we targeted a 1:12 ratio as a balance between performance and cost.
2. Compression Testing with Partial Migration
Rather than estimating, we conducted empirical compression testing by analyzing existing data already loaded into a test ClickHouse environment:
Test Methodology:
Compared storage sizes between S3 (raw/processed data) and ClickHouse compressed storage
Used ZSTD compression (ClickHouse's efficient default)
Analyzed compression ratios across different data types
Current Data in S3:
Data mart tables: 2.5 TB
Clean tables in data lake: 2.5 TB
Raw tables in data lake: 425 GB
Total: 5.4 TB
Compression Results:
Data Type S3 Storage ClickHouse (ZSTD) Compression Ratio Data mart tables 2.5 TB 325 GB 7.69x Clean tables 2.5 TB 638 GB 3.92x Raw tables (via PeerDB) 425 GB 104 GB 4.09x Total 5.4 TB ~1.0 TB ~5.4x average
Key Insight: The compression ratios varied significantly based on data characteristics. Data mart tables (highly aggregated, optimized schemas) achieved the best compression at 7.69x, while raw tables with more varied data types achieved 4.09x compression. This validated that ClickHouse's columnar storage with ZSTD compression could handle the entire dataset in approximately 1TB of storage.
3. Total Data Storage Calculations
Based on empirical compression testing with actual data, we calculated storage requirements:
Current Data Volume:
Total data in S3: 5.4 TB
Measured compressed size in ClickHouse: ~1.0 TB
Growth Planning:
Current compressed storage: 1.0 TB
Expected growth over 18-24 months: Additional 300-500 GB
Target total capacity: 1.5 TB per node
Operational Overhead Buffer:
The 1.5TB allocation provides 50% buffer for:
Temporary tables during transformations
MergeTree parts during background merges
Replication queue overhead
Growth runway
Storage Allocation Decision:
Started with 250GB EBS gp3 volumes per node for initial migration
Planned expansion to 1.5TB volumes as data migration completes
Total cluster storage at full capacity: 3TB (2 replicas × 1.5TB)
Phased approach reduces initial costs during migration phase
4. Instance Selection and Memory Sizing
We followed ClickHouse's recommended memory-to-storage ratio guidelines while accounting for the workload characteristics:
Memory-to-Storage Ratio:
ClickHouse recommends 1:100 ratio for internal analytics tools with moderate query loads
For workloads with heavier concurrent queries, 1:50 ratio provides better performance
Our initial assessment targeted 1:100, with flexibility to scale up if needed
Memory Requirements:
Current uncompressed data in ClickHouse: ~170 GB
Initial compressed storage: ~1.0 TB
Using 1:100 ratio: 1 TB ÷ 100 = 10 GB minimum
Using 1:50 ratio: 1 TB ÷ 50 = 20 GB minimum
Initial Instance Type:
Started with r6g.xlarge instances (32 GB RAM each):
Provided comfortable buffer above minimum requirements
ClickHouse production recommendation: 32GB minimum
Cost-effective for initial deployment
Graviton-based for better price-performance
Scaling to Production Workload:
After observing actual query patterns and concurrency, we upgraded to r6g.4xlarge instances (128 GB RAM each):
128 GB RAM per instance handles growing workload
16 vCPUs support higher query parallelization
Better cache hit rates with more available memory
Provides headroom for 1:12 memory-to-data ratio (optimal for query-heavy workloads)
Graviton processors deliver 20-30% better price-performance vs Intel/AMD
5. Final Cluster Cost Breakdown
Component Specification Monthly Cost 2× EC2 r6g.4xlarge 16 vCPU, 128GB RAM ~$710 × 2 = $1,420 2× EBS gp3 (250GB initially, scaling to 1.5TB) 3000 IOPS $24 - $144 × 2 3× EC2 t4g.medium (Keeper) 2 vCPU, 4GB RAM $30.95 × 3 = $93 1× EC2 t3.medium (PeerDB) 2 vCPU, 4GB RAM ~$30 1× EC2 t3.medium (Prefect) 2 vCPU, 4GB RAM ~$30 Initial Infrastructure (250GB storage) ~$1,621/month Full Capacity (1.5TB storage) ~$1,861/month
Cost Comparison:
Previous AWS stack (production): ~$5,000/month
New ClickHouse stack (initial): ~$1,621/month
New ClickHouse stack (full capacity): ~$1,861/month
Monthly savings: $3,139 - $3,379
Annual savings: ~$38,000 - $40,000
Cost reduction: ~63-68%
6. Monitoring and Validation
*Post-migration, we validated capacity planning through comprehensive monitoring:
Key Performance Indicators (3-month average):
Memory utilization: 58% (well within capacity)
CPU utilization: 35% (headroom for query spikes)
Storage utilization: 42% at initial phase (growing as migration completes)
Query latency p95: 4.2 seconds (comparable to Athena)
Cache hit rate: 87% (excellent memory sizing)
Replication lag: avg 1.8 seconds, max 7 seconds
These metrics validated our capacity planning assumptions and confirmed sufficient headroom for projected growth.
Infrastructure as Code with Terraform
The entire cluster was provisioned using Terraform, ensuring reproducibility and easy disaster recovery:
Terraform Modules:
vpc: Network infrastructure (using existing VPC)clickhouse_server: EC2 instances, EBS volumes, security groupsclickhouse_keeper: Keeper cluster instances across AZs
Key Configuration Variables:
environment = "production" clickhouse_shard_count = 1 clickhouse_replica_count = 2 clickhouse_server_instance_type = "r6g.4xlarge" clickhouse_server_ebs_size = 250 # Initial, scaling to 1500 clickhouse_keeper_count = 3 clickhouse_keeper_instance_type = "t4g.medium"
Automated Configuration Generation:
A custom script generates ClickHouse configuration files from Terraform outputs:
bash generate_configs.sh \\\\ --cluster-name prod \\\\ --clickhouse-user admin \\\\ --clickhouse-password ${CLICKHOUSE_PASSWORD}
This generates:
XML configuration files for each ClickHouse node
Docker Compose configurations
Environment files with credentials
Deployment scripts
Key Takeaways
Migrating from AWS managed services to self-hosted ClickHouse delivered a 63-68% cost reduction (~$38-40K annual savings) while maintaining operational standards. Several factors contributed to this success:
Cost Structure Matters: AWS managed services provide convenience but charge premium prices. For stable workloads with predictable patterns, self-hosting on EC2 with open-source tools can reduce costs by 60-70% without sacrificing capabilities.
Empirical Testing Over Estimation: Actual compression testing with ZSTD yielded varied results (3.92x to 7.69x depending on data type, ~5.4x average), directly impacting capacity planning and cost projections. Always test with real data before committing to infrastructure sizing.
High Availability Doesn't Require Complexity: The 1-shard, 2-replica architecture provided production-grade availability without the operational overhead of sharded deployments. For many use cases, vertical scaling within a single shard is simpler and more cost-effective than distributed sharding.
Open-Source Ecosystem Maturity: Modern open-source tools (PeerDB, Prefect, dbt, OpenTelemetry) rival managed services in functionality. The key is investing time in proper setup, monitoring, and operational procedures.
Phased Capacity Planning: Starting with 250GB storage and r6g.xlarge instances, then scaling to 1.5TB and r6g.4xlarge based on actual usage patterns prevented both over-provisioning at launch and performance issues during growth.
Memory-to-Storage Ratios Guide Scaling: Starting with 1:100 ratio (suitable for internal analytics) and scaling to more aggressive ratios (1:12) as query patterns evolved provided a clear framework for instance sizing decisions.
Observability Is Non-Negotiable: Unlike AWS managed services with built-in monitoring, self-hosted infrastructure requires explicit investment in observability. OpenTelemetry + Last9 provided comparable visibility to CloudWatch at a fraction of the cost.
Infrastructure as Code Pays Dividends: Terraform and automated configuration generation made the cluster reproducible, testable, and recoverable. This reduces operational risk and speeds up disaster recovery scenarios.
For organizations facing similar AWS cost challenges, this migration demonstrates that moving to self-hosted open-source infrastructure is not just feasible it can deliver substantial cost savings (60-70% reduction) while maintaining (or exceeding) the reliability, performance, and operational standards of managed services. The key is thorough planning, empirical testing with real data, phased scaling based on actual usage patterns, and commitment to building proper observability and automation.












