Context.
The client is a no-code platform to build mobile apps for a Shopify store. The platform is hosted on AWS and serves customers across the globe. They have been in operation for 7+ years and have a 30+ member dev team with not much DevOps expertise in the team.
Problem Statement.
Outcome/Impact.
Solution.
Cost reduction over the time
This was the cost reduction over 6 month period. Cost reduction happened without affecting the product iteration speed, and the dev team kept on shipping features in parallel.
We started the cost reduction exercise by analyzing the billing export from AWS for the last few months. We thoroughly analyzed the current cost spent across regions and AWS services.
We found that the Compute and RDS costs were abnormally high. We worked closely with the dev team to dive deeper into each major cost categories.
We optimized Compute costs in the following way.
Removed numerous unattached EBS volumes due to terminated instances, saving $2k/month. Also, identified the root cause and refined the process for creating new instances.
Optimized other unused Compute resources. For EC2, we scrutinized CPU usage and optimized instances with under 50% CPU utilization.
Further examined memory usage for the above EC2 instances, optimizing those with less than 50% memory utilization.
Identified overprovisioned VMs and implemented a zero-downtime approach for downsizing, ensuring no impact on user traffic.
Compute cost reduction
Overall cost reduction for Compute resources was 42%.
It was then time to look at how we can reduce RDS costs. For this, we did the following.
RDS cost reduction
The existing RDS had occasional CPU spikes and couldn’t be downsized directly.
We enabled and analysed slow query logs and worked with the dev team to implement application side fixes. This resulted in more predictable load on the DB and then we could optimize the database infrastructure.
We downsized RDS instance types using zero downtime strategy to avoid business impact to users.
We also renegotiated Reserved Instances plans due to the change in the instance types.
Overall cost reduction for RDS was 56%.
The next highest cost was AWS data transfer.
We reduced intra-region data transfer costs using VPC flow logs and moved the highest chatting resources in the same zones.
We also found out that there was a duplication of job processing for background workers. We worked with the dev team to identify duplicate jobs and stop them.
We also implemented VPC endpoints for Opensearch, the application log store thereby saving data transfer costs.
Data transfer cost reduction
Overall data transfer cost reduction was 90%
For cloud governance,We streamlined the IAM access via groups and policies and ensured no direct permissions are assigned to any AWS user.We set billing alerts and cost anomaly alerts for important AWS servicesWe tagged all the important resources using labels. This enabled us to have continuous cost monitoring per environment.
Watch this talk for a detailed understanding of the work, presented at the Cloud Cost Conference in July 2023.