In this post, we’ll explore the fundamentals of spot instances, the risks associated with their use in production, and key strategies to ensure seamless operations. We’ll also dive into a real-world incident where a spot instance termination caused service downtime and share the lessons learned to prevent similar issues in the future.
Introduction:
When it comes to managing costs in cloud environments, spot instances have become a popular choice, especially in Kubernetes clusters like Amazon EKS (Elastic Kubernetes Service).
Spot instances are a fascinating option that many organizations consider to power up their applications. But before jumping on board, it's essential to truly understand what they offer and how you can harness their potential effectively.
What are Spot Instances?
Spot instances are virtual machines offered at discounted rates by AWS. They leverage unused EC2 capacity, meaning you can acquire the same servers at a much lower price than on-demand instances. However, there's a catch: these instances can be reclaimed by AWS at any time with minimal notice typically just a few minutes.
Risks of Using Spot Instances in Production
While AWS Spot Instances provide substantial cost savings, they come with inherent risks that can significantly impact production workloads if not managed properly. Below are some critical challenges associated with using spot instances in a production environment.
1. Unpredictable Termination
Unlike On-Demand or Reserved Instances, spot instances can be reclaimed at any time with only a two-minute warning. If the application running on a spot instance lacks proper redundancy or failover mechanisms, this can lead to unexpected service outages and disruptions.
2. Service Downtime
If workloads are running on a single replica or without an adequate scaling strategy, a spot instance termination can render them completely unavailable until a new instance is provisioned and the service is redeployed.
3. PodDisruptionBudget (PDB) Enforcement
In Kubernetes, PodDisruptionBudgets (PDBs) are typically used to control the number of pods that can be disrupted at a time. However, AWS does not respect PDB policies when reclaiming spot instances, meaning pods may be evicted without adhering to defined availability constraints. This can lead to critical application failures.
4. Delayed Recovery
Even though AWS automatically provisions replacement spot instances, there is no guarantee that new capacity will be immediately available. If spot capacity is unavailable in the selected instance types or Availability Zones, launching a new instance may take significantly longer, extending service downtime.
5. Data Loss
Since spot instances can terminate at any time, any data stored locally on the instance (ephemeral storage) will be lost unless it is continuously written to persistent storage (such as Amazon EBS, S3, or an external database).
Real-World Incident: Spot Instance Termination in Production
On October 23, 2024, at 7:36 AM IST, users reported that a critical feature relying on the XYZ-service
main service responsible for communication with multiple dependent services, was unavailable. The service had restarted because its underlying spot instance was terminated by AWS.
Normally, spot instances would be allocated within 3 mins. But in this case, it led to an extended downtime. This event made us question the predictability of recovery in outages.
We realized that relying solely on spot instances introduces uncontrollable outage times, as there’s no guarantee on how quickly a new spot node will be allocated.
PodDisruptionBudgets (PDB) were ineffective, as AWS terminates spot instances without respecting PDB policies.
Impact on Service Availability:
Due to these factors, the service remained unavailable until the new spot node was launched and the pod was scheduled.
To mitigate these risks in the future, we ensured that we followed these practices:
Identify critical services and use reserved instances for them.
For medium priority services which can handle minor degradation, configure Karpenter to have a mix of on demand and spot instances.
Monitor the ratio of reserved to spot instances across all services.
Educate the customer on the cost v/s risk factors for Spot instances so that these resiliency decisions can be baked in earlier in the deployment cycle.
Conclusion:
While AWS Spot Instances offer cost advantages, they require careful configuration to ensure stability in production. By incorporating on-demand fallback, increasing redundancy, implementing strong observability tools, and optimizing fail-over strategies, you can balance cost savings with high availability. Our recent incident highlighted the importance of these best practices, leading to improvements in our infrastructure that ensure better resilience against spot instance interruptions.