Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#AWS

#Infrastructure-As-Code

#Best-Practices

#SRE

Feb 24, 2025 | 4 min read

Gotchas to avoid when using spot instances in production

Aashi Rathore

SRE @One2N

Back to Blog

#AWS

#Infrastructure-As-Code

#Best-Practices

#SRE

Feb 24, 2025 | 4 min read

Gotchas to avoid when using spot instances in production

Aashi Rathore

SRE @One2N

Back to Blog

#AWS

#Infrastructure-As-Code

#Best-Practices

#SRE

Feb 24, 2025 | 4 min read

Gotchas to avoid when using spot instances in production

Aashi Rathore

SRE @One2N

Back to Blog

#AWS

#Infrastructure-As-Code

#Best-Practices

#SRE

Feb 24, 2025 | 4 min read

Gotchas to avoid when using spot instances in production

Aashi Rathore

SRE @One2N

Introduction:

When it comes to managing costs in cloud environments, spot instances have become a popular choice, especially in Kubernetes clusters like Amazon EKS (Elastic Kubernetes Service).

Spot instances are a fascinating option that many organizations consider to power up their applications. But before jumping on board, it's essential to truly understand what they offer and how you can harness their potential effectively.

What are spot instances?

Spot instances are virtual machines offered at discounted rates by AWS. They leverage unused EC2 capacity, meaning you can acquire the same servers at a much lower price than on-demand instances. However, there's a catch: these instances can be reclaimed by AWS at any time with minimal notice typically just a few minutes.

Risks of using spot instances in production

While AWS Spot Instances provide substantial cost savings, they come with inherent risks that can significantly impact production workloads if not managed properly. Below are some critical challenges associated with using spot instances in a production environment.

1. Unpredictable termination

Unlike On-Demand or Reserved Instances, spot instances can be reclaimed at any time with only a two-minute warning. If the application running on a spot instance lacks proper redundancy or failover mechanisms, this can lead to unexpected service outages and disruptions.

2. Service downtime

If workloads are running on a single replica or without an adequate scaling strategy, a spot instance termination can render them completely unavailable until a new instance is provisioned and the service is redeployed.

3. PodDisruptionBudget (PDB) enforcement

In Kubernetes, PodDisruptionBudgets (PDBs) are typically used to control the number of pods that can be disrupted at a time. However, AWS does not respect PDB policies when reclaiming spot instances, meaning pods may be evicted without adhering to defined availability constraints. This can lead to critical application failures.

4. Delayed recovery

Even though AWS automatically provisions replacement spot instances, there is no guarantee that new capacity will be immediately available. If spot capacity is unavailable in the selected instance types or Availability Zones, launching a new instance may take significantly longer, extending service downtime.

5. Data loss

Since spot instances can terminate at any time, any data stored locally on the instance (ephemeral storage) will be lost unless it is continuously written to persistent storage (such as Amazon EBS, S3, or an external database).

Real world incident: spot instance termination in production

On October 23, 2024, at 7:36 AM IST, users reported that a critical feature relying on the XYZ-service main service responsible for communication with multiple dependent services, was unavailable. The service had restarted because its underlying spot instance was terminated by AWS.

Normally, spot instances would be allocated within 3 mins. But in this case, it led to an extended downtime. This event made us question the predictability of recovery in outages.

We realized that relying solely on spot instances introduces uncontrollable outage times, as there’s no guarantee on how quickly a new spot node will be allocated.

PodDisruptionBudgets (PDB) were ineffective, as AWS terminates spot instances without respecting PDB policies.

Impact on service availability:

Due to these factors, the service remained unavailable until the new spot node was launched and the pod was scheduled.

To mitigate these risks in the future, we ensured that we followed these practices:

Identify critical services and use reserved instances for them.
For medium priority services which can handle minor degradation, configure Karpenter to have a mix of on demand and spot instances.
Monitor the ratio of reserved to spot instances across all services.
Educate the customer on the cost v/s risk factors for Spot instances so that these resiliency decisions can be baked in earlier in the deployment cycle.

Conclusion:

While AWS Spot Instances offer cost advantages, they require careful configuration to ensure stability in production. By incorporating on-demand fallback, increasing redundancy, implementing strong observability tools, and optimizing fail-over strategies, you can balance cost savings with high availability. Our recent incident highlighted the importance of these best practices, leading to improvements in our infrastructure that ensure better resilience against spot instance interruptions.

Introduction:

When it comes to managing costs in cloud environments, spot instances have become a popular choice, especially in Kubernetes clusters like Amazon EKS (Elastic Kubernetes Service).

What are spot instances?

Risks of using spot instances in production

1. Unpredictable termination

2. Service downtime

3. PodDisruptionBudget (PDB) enforcement

4. Delayed recovery

5. Data loss

Real world incident: spot instance termination in production

Normally, spot instances would be allocated within 3 mins. But in this case, it led to an extended downtime. This event made us question the predictability of recovery in outages.

We realized that relying solely on spot instances introduces uncontrollable outage times, as there’s no guarantee on how quickly a new spot node will be allocated.

PodDisruptionBudgets (PDB) were ineffective, as AWS terminates spot instances without respecting PDB policies.

Impact on service availability:

Due to these factors, the service remained unavailable until the new spot node was launched and the pod was scheduled.

To mitigate these risks in the future, we ensured that we followed these practices:

Identify critical services and use reserved instances for them.
For medium priority services which can handle minor degradation, configure Karpenter to have a mix of on demand and spot instances.
Monitor the ratio of reserved to spot instances across all services.
Educate the customer on the cost v/s risk factors for Spot instances so that these resiliency decisions can be baked in earlier in the deployment cycle.

Conclusion:

Introduction:

When it comes to managing costs in cloud environments, spot instances have become a popular choice, especially in Kubernetes clusters like Amazon EKS (Elastic Kubernetes Service).

What are spot instances?

Risks of using spot instances in production

1. Unpredictable termination

2. Service downtime

3. PodDisruptionBudget (PDB) enforcement

4. Delayed recovery

5. Data loss

Real world incident: spot instance termination in production

Normally, spot instances would be allocated within 3 mins. But in this case, it led to an extended downtime. This event made us question the predictability of recovery in outages.

We realized that relying solely on spot instances introduces uncontrollable outage times, as there’s no guarantee on how quickly a new spot node will be allocated.

PodDisruptionBudgets (PDB) were ineffective, as AWS terminates spot instances without respecting PDB policies.

Impact on service availability:

Due to these factors, the service remained unavailable until the new spot node was launched and the pod was scheduled.

To mitigate these risks in the future, we ensured that we followed these practices:

Identify critical services and use reserved instances for them.
For medium priority services which can handle minor degradation, configure Karpenter to have a mix of on demand and spot instances.
Monitor the ratio of reserved to spot instances across all services.
Educate the customer on the cost v/s risk factors for Spot instances so that these resiliency decisions can be baked in earlier in the deployment cycle.

Conclusion:

Introduction:

When it comes to managing costs in cloud environments, spot instances have become a popular choice, especially in Kubernetes clusters like Amazon EKS (Elastic Kubernetes Service).

What are spot instances?

Risks of using spot instances in production

1. Unpredictable termination

2. Service downtime

3. PodDisruptionBudget (PDB) enforcement

4. Delayed recovery

5. Data loss

Real world incident: spot instance termination in production

Normally, spot instances would be allocated within 3 mins. But in this case, it led to an extended downtime. This event made us question the predictability of recovery in outages.

We realized that relying solely on spot instances introduces uncontrollable outage times, as there’s no guarantee on how quickly a new spot node will be allocated.

PodDisruptionBudgets (PDB) were ineffective, as AWS terminates spot instances without respecting PDB policies.

Impact on service availability:

Due to these factors, the service remained unavailable until the new spot node was launched and the pod was scheduled.

To mitigate these risks in the future, we ensured that we followed these practices:

Identify critical services and use reserved instances for them.
For medium priority services which can handle minor degradation, configure Karpenter to have a mix of on demand and spot instances.
Monitor the ratio of reserved to spot instances across all services.
Educate the customer on the cost v/s risk factors for Spot instances so that these resiliency decisions can be baked in earlier in the deployment cycle.

Conclusion:

Introduction:

When it comes to managing costs in cloud environments, spot instances have become a popular choice, especially in Kubernetes clusters like Amazon EKS (Elastic Kubernetes Service).

What are spot instances?

Risks of using spot instances in production

1. Unpredictable termination

2. Service downtime

3. PodDisruptionBudget (PDB) enforcement

4. Delayed recovery

5. Data loss

Real world incident: spot instance termination in production

Normally, spot instances would be allocated within 3 mins. But in this case, it led to an extended downtime. This event made us question the predictability of recovery in outages.

We realized that relying solely on spot instances introduces uncontrollable outage times, as there’s no guarantee on how quickly a new spot node will be allocated.

PodDisruptionBudgets (PDB) were ineffective, as AWS terminates spot instances without respecting PDB policies.

Impact on service availability:

Due to these factors, the service remained unavailable until the new spot node was launched and the pod was scheduled.

To mitigate these risks in the future, we ensured that we followed these practices:

Identify critical services and use reserved instances for them.
For medium priority services which can handle minor degradation, configure Karpenter to have a mix of on demand and spot instances.
Monitor the ratio of reserved to spot instances across all services.
Educate the customer on the cost v/s risk factors for Spot instances so that these resiliency decisions can be baked in earlier in the deployment cycle.

Conclusion:

Jump to section

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Blog

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Services

Resources

Company

Gotchas to avoid when using spot instances in production

Gotchas to avoid when using spot instances in production

Gotchas to avoid when using spot instances in production

Gotchas to avoid when using spot instances in production

Gotchas to avoid when using spot instances in production

Share

Jump to section

Related posts

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content