🐢
A founder of a recently funded company calls you. He wants you to come in and fix webapp performance problems. You have been in similar situations before and want another thrill, so you say:
"Yes, I'll Take A Look"
But this isn't the typical problem you have faced earlier 👇
You start by asking questions about the business context, current tech stack, team structure, and their burning problems.
It's a relatively simple webapp with
React frontend
Python backend and
PostgreSQL, hosted on Heroku
The code has been built mostly by junior engineers over the last 2 years.
The data size isn't huge (a few GBs). There are 3 apps, each deployed on Heroku with its own DB.
"The Webapp Is Slow - Sometimes It Takes More Than 10 Seconds To Load A Simple Page And That Too, During Low Traffic Time"
the founder tells you.
Investigation 🧐
You get access to the Heroku dashboard and start exploring - Dynos, scaling configuration, database, cross-application communication, application routing, etc...
You're looking for observability data to find out the application RED metrics (rate, error, and duration).
You find out that the apps on Heroku communicate with each other via public URLs. Also, since Heroku treats each app as an independent component, each app gets its own DB with scaling policies, etc. The DB is also running on public IP.
Ouch!
Anyway, you check the app metrics to find some requests taking > 8 seconds. Since there's no way to trace the request across apps, you don't have a sense of the actual latency for the user. You also take a look at the monthly costs.
You can't believe what you're seeing!
For the last two months, Heroku costs have reached closer to $10k per month!
You highlight this to the founder, and he agrees that we need to reduce this cost drastically.
"Okay, This Is A Rabbit Hole And I Ain't Getting To The End Of It"
you say to yourself and start creating a document.
Solution 📝
Doc summary:
Migrate off of Heroku to AWS instead of trying to fix problems on Heroku.
Reasons:
Lack of control over infrastructure
Limited observability for databases
Inability to debug and correlate events across apps
You know the company has some credits from AWS
The company has already tried everything it can to improve performance on Heroku but wasn't successful. You spend two days coming up with a detailed plan with tasks, their dependencies, and timelines. You share this with the team so everyone is on the same page.
High-level plan
Migrate stateless services to AWS while still pointing to DBs on Heroku
Migrate stateful services to AWS
Ensure this works on Staging first, before migrating prod.
You'll also need to set up proper CI/CD for this new setup.
You set up the VPC with private and public subnets such that DBs should run in a private subnet.
You decide to go ahead with ECS considering:
the learning curve for the team,
ease of operation and
running and maintenance cost
ECS is a decent abstraction, given the context.
Okay, fast forward:
You set up CI/CD using GitHub Actions
Migrate the staging env within a couple of weeks
Set up prod and migrate stateless services on prod to AWS
For the DB, you set up live replication to RDS so that the switchover can be done without much downtime
Finally, you migrate the DB to RDS, and apps now point to DB on AWS RDS. You design the migration to be backward compatible such that, at any point, you can revert to the previous state with just a deployment (transparent to the users). This way, you're prepared for any unknowns.
Things mostly go according to plan, thanks to your robust checklist and initial planning. The migration is carried out without any downtime.
Okay, But Was It All Worth It?
Did We Actually Achieve Performance Improvements?
Results 🏁
You observe almost 3x perf improvements with more than 50% cost reduction!
To your surprise, you see more than 2x traffic increase on the app.
Most likely, users were frustrated and didn't use the site much as it was slow. Not anymore! Now, you have more active users than before!
This isn't the end of it.
You also ask the dev team to investigate and fix slow queries, check indexes, write some load tests for slow APIs, etc. In all, these efforts pay off big time. You make business metrics available to the entire team. The team is motivated, and the founder is happy!
Lessons ✅
In the 0-1 phase, Heroku works. It serves you well. Be ready to grow up when the time comes. You'll (mostly) know when.
Aim for backward-compatible migrations! Zero downtime, too, if you can get it easily.
Don't run DBs on public IPs with open ports!
In the 1-N phase, developer habits have to change. Focus on automated testing, faster feedback loops, etc.
Kubernetes is NOT the solution for every startup. They may need it eventually, but not in this case.
Evaluate build vs. buy decisions carefully.
I write such stories on software engineering. There's no specific frequency as I don't make up these. If you liked this one, you might love - 🧹A case of Terabyte scale backup and recovery solution