Services

Resources

Company

Jun 25, 2025 | 3 min read

DARE to question your alerts?

DARE to question your alerts?

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Jun 25, 2025 | 3 min read

DARE to question your alerts?

Sometimes, you come across a problem that seems straightforward but quickly turns into a complex networking puzzle. This is one of those stories. It all starts with a common business need: connecting two different networks securely. For one of our clients in the health-tech industry, setting up a Site-to-Site VPN with a key partner wasn’t just a “nice-to-have”, it was essential for their operations.

The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch. Nothing. The connection was dead on arrival.

Diagnosing the problem

DIG - output

/ping service

curl -v

The culprit was a classic networking nightmare for anyone managing cloud infrastructure: both our client and their partner were using the exact same private CIDR range (10.0.0.0/22).


[diagram] A simple diagram showing two clouds, "Client VPC" and "Partner VPC." Both have the label "CIDR: 10.0.0.0/22". A VPN tunnel connects them with a large red "X" over it, labeled "IP CONFLICT." [diagram]

Imagine telling a mail carrier to deliver a package to '123 Main Street' in a town where two different houses have that exact same address. Where does the package go? It goes nowhere.

In networking, when a router sees a packet for an IP address that exists on both sides of a connection, it has no idea where to send it. This is a common issue in VPC peering and VPN setups that can halt all communication .

First Attempt: A NAT gateway for Outbound Traffic

The initial thought was simple. Let's just give our client a new, unique address. We added a secondary IP range, 10.120.0.0/22, to their network. The plan was to make all traffic heading to the partner look like it came from this new, unique address space.

This solved half the problem. The outbound half.

We set up a NAT (Network Address Translation) Gateway in the new subnet. A NAT Gateway acts like a receptionist for your network. When an application in the old 10.0.x.x range sent data to the partner, the NAT Gateway would change the "from" address to its own IP in the new 10.120.x.x range before sending it out.


[diagram] The flow looked like this: Application (10.0.x.x) → NAT Gateway (10.120.x.x) → VPN → Partner The path should be highlighted with a green arrow labeled "Outbound: SUCCESS. [diagram]

It worked! The partner received the request from an IP address they could recognize. We had outbound communication.

The Real Problem: Asymmetric routing breaks inbound connections

When the partner’s system tried to send data back, the connection would break. This is where many VPN troubleshooting efforts get stuck.

Here’s why: The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x subnet. The application got the request and sent a response. But the response took a different path. Our new routing rules forced it through the NAT Gateway, which again changed the source IP to the 10.120.x.x range.

From the partner's perspective, this was bizarre. They sent a message to an address in the 10.0.x.x range but got a reply from a completely different address in the 10.120.x.x range. This is known as asymmetric routing, and it’s a deal-breaker for stateful connections like TCP.

Imagine calling a friend from your phone, but they call you back from a blocked number to give you the answer. You’d probably ignore it, right? That’s exactly what the partner’s firewall did. It saw a response from an IP address it never sent a request to, thought it was suspicious, and dropped the connection.


A diagram showing two paths. Path 1 (Request): Partner → VPN → ALB (10.0.x.x) → App. Path 2 (Response): App → NAT Gateway (10.120.x.x) → VPN → Partner. The response path should have a large red "X" on it, labeled "CONNECTION FAILED: Source IP Mismatch."

The search for a solution: other attempts we tried

Before we found the right fix, we explored a few other common approaches.

Attempt #1: The "Just Change Everything" Idea The most "obvious" solution is to just change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the right call. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.

Attempt #2: The Network Load Balancer (NLB) Our next thought was to use a Network Load Balancer (NLB). Theoretically, this is a great fit. Unlike an ALB, an NLB can preserve the client's source IP address, which would solve the asymmetric routing problem. We set one up and manually configured the rules. It worked!

But then, the next time the developers deployed their application, it broke again. The client's infrastructure was managed by a Kubernetes ALB Ingress Controller. This piece of automation is great for managing web traffic, but it's also very particular. It saw our manual ALB rules, didn't recognize them as part of its configuration, and simply deleted them.

The final solution: Working with automation, not against it

Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand.

So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.0.0/22 subnet.

  1. Dedicated ALB: This new ALB was created just to handle traffic from this partner.

  2. Updated Ingress Rules: We updated the Kubernetes ingress configuration for the partner-facing service to use this new ALB. This was a change the automation understood and accepted.

  3. Clean Traffic Flow: The domain the partner connects to now resolves directly to the new ALB in the 10.120.x.x range.

Now, the entire conversation, both request and response, happens through an endpoint in the non-conflicting 10.120.x.x range. The connection is stable. Best of all, the solution is defined in code and managed by the existing automation, making it a robust and maintainable fix.

A clean diagram showing a symmetric flow: Partner ↔ VPN ↔ New ALB (in 10.120.x.x subnet) ↔ Application (in 10.0.x.x subnet). The entire path should have a green, double-headed arrow labeled "SUCCESSFUL CONNECTION."


Key takeaways for troubleshooting cloud networking issues

This was a powerful reminder of a few core engineering truths for anyone dealing with cloud networking and scalability:

  • Look Beyond the Obvious Fix: The simplest solution often only solves half the problem. You have to trace the entire path of a request and its response to see the full picture and diagnose issues like asymmetric routing.

  • Don't Fight Your Automation: When a manual fix conflicts with your automation, it’s a sign that the fix is wrong, not the automation. The better, more resilient solution is always one that works with your automated systems, following DevOps principles.

  • Pragmatism Wins: The goal isn't theoretical perfection. It's about solving the business problem in a way that is reliable, secure, and maintainable.

Networking puzzles like these can be frustrating, but they are also incredibly satisfying to solve. By digging past the initial symptoms and understanding the complete system, we were able to untangle the knot and build a solution that just works.

If you're facing your own networking knots or challenges with scaling your infrastructure on AWS, we'd love to help. Reach out to us at One2N. We enjoy a good puzzle.

Sometimes, you come across a problem that seems straightforward but quickly turns into a complex networking puzzle. This is one of those stories. It all starts with a common business need: connecting two different networks securely. For one of our clients in the health-tech industry, setting up a Site-to-Site VPN with a key partner wasn’t just a “nice-to-have”, it was essential for their operations.

The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch. Nothing. The connection was dead on arrival.

Diagnosing the problem

DIG - output

/ping service

curl -v

The culprit was a classic networking nightmare for anyone managing cloud infrastructure: both our client and their partner were using the exact same private CIDR range (10.0.0.0/22).


[diagram] A simple diagram showing two clouds, "Client VPC" and "Partner VPC." Both have the label "CIDR: 10.0.0.0/22". A VPN tunnel connects them with a large red "X" over it, labeled "IP CONFLICT." [diagram]

Imagine telling a mail carrier to deliver a package to '123 Main Street' in a town where two different houses have that exact same address. Where does the package go? It goes nowhere.

In networking, when a router sees a packet for an IP address that exists on both sides of a connection, it has no idea where to send it. This is a common issue in VPC peering and VPN setups that can halt all communication .

First Attempt: A NAT gateway for Outbound Traffic

The initial thought was simple. Let's just give our client a new, unique address. We added a secondary IP range, 10.120.0.0/22, to their network. The plan was to make all traffic heading to the partner look like it came from this new, unique address space.

This solved half the problem. The outbound half.

We set up a NAT (Network Address Translation) Gateway in the new subnet. A NAT Gateway acts like a receptionist for your network. When an application in the old 10.0.x.x range sent data to the partner, the NAT Gateway would change the "from" address to its own IP in the new 10.120.x.x range before sending it out.


[diagram] The flow looked like this: Application (10.0.x.x) → NAT Gateway (10.120.x.x) → VPN → Partner The path should be highlighted with a green arrow labeled "Outbound: SUCCESS. [diagram]

It worked! The partner received the request from an IP address they could recognize. We had outbound communication.

The Real Problem: Asymmetric routing breaks inbound connections

When the partner’s system tried to send data back, the connection would break. This is where many VPN troubleshooting efforts get stuck.

Here’s why: The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x subnet. The application got the request and sent a response. But the response took a different path. Our new routing rules forced it through the NAT Gateway, which again changed the source IP to the 10.120.x.x range.

From the partner's perspective, this was bizarre. They sent a message to an address in the 10.0.x.x range but got a reply from a completely different address in the 10.120.x.x range. This is known as asymmetric routing, and it’s a deal-breaker for stateful connections like TCP.

Imagine calling a friend from your phone, but they call you back from a blocked number to give you the answer. You’d probably ignore it, right? That’s exactly what the partner’s firewall did. It saw a response from an IP address it never sent a request to, thought it was suspicious, and dropped the connection.


A diagram showing two paths. Path 1 (Request): Partner → VPN → ALB (10.0.x.x) → App. Path 2 (Response): App → NAT Gateway (10.120.x.x) → VPN → Partner. The response path should have a large red "X" on it, labeled "CONNECTION FAILED: Source IP Mismatch."

The search for a solution: other attempts we tried

Before we found the right fix, we explored a few other common approaches.

Attempt #1: The "Just Change Everything" Idea The most "obvious" solution is to just change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the right call. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.

Attempt #2: The Network Load Balancer (NLB) Our next thought was to use a Network Load Balancer (NLB). Theoretically, this is a great fit. Unlike an ALB, an NLB can preserve the client's source IP address, which would solve the asymmetric routing problem. We set one up and manually configured the rules. It worked!

But then, the next time the developers deployed their application, it broke again. The client's infrastructure was managed by a Kubernetes ALB Ingress Controller. This piece of automation is great for managing web traffic, but it's also very particular. It saw our manual ALB rules, didn't recognize them as part of its configuration, and simply deleted them.

The final solution: Working with automation, not against it

Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand.

So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.0.0/22 subnet.

  1. Dedicated ALB: This new ALB was created just to handle traffic from this partner.

  2. Updated Ingress Rules: We updated the Kubernetes ingress configuration for the partner-facing service to use this new ALB. This was a change the automation understood and accepted.

  3. Clean Traffic Flow: The domain the partner connects to now resolves directly to the new ALB in the 10.120.x.x range.

Now, the entire conversation, both request and response, happens through an endpoint in the non-conflicting 10.120.x.x range. The connection is stable. Best of all, the solution is defined in code and managed by the existing automation, making it a robust and maintainable fix.

A clean diagram showing a symmetric flow: Partner ↔ VPN ↔ New ALB (in 10.120.x.x subnet) ↔ Application (in 10.0.x.x subnet). The entire path should have a green, double-headed arrow labeled "SUCCESSFUL CONNECTION."


Key takeaways for troubleshooting cloud networking issues

This was a powerful reminder of a few core engineering truths for anyone dealing with cloud networking and scalability:

  • Look Beyond the Obvious Fix: The simplest solution often only solves half the problem. You have to trace the entire path of a request and its response to see the full picture and diagnose issues like asymmetric routing.

  • Don't Fight Your Automation: When a manual fix conflicts with your automation, it’s a sign that the fix is wrong, not the automation. The better, more resilient solution is always one that works with your automated systems, following DevOps principles.

  • Pragmatism Wins: The goal isn't theoretical perfection. It's about solving the business problem in a way that is reliable, secure, and maintainable.

Networking puzzles like these can be frustrating, but they are also incredibly satisfying to solve. By digging past the initial symptoms and understanding the complete system, we were able to untangle the knot and build a solution that just works.

If you're facing your own networking knots or challenges with scaling your infrastructure on AWS, we'd love to help. Reach out to us at One2N. We enjoy a good puzzle.

Sometimes, you come across a problem that seems straightforward but quickly turns into a complex networking puzzle. This is one of those stories. It all starts with a common business need: connecting two different networks securely. For one of our clients in the health-tech industry, setting up a Site-to-Site VPN with a key partner wasn’t just a “nice-to-have”, it was essential for their operations.

The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch. Nothing. The connection was dead on arrival.

Diagnosing the problem

DIG - output

/ping service

curl -v

The culprit was a classic networking nightmare for anyone managing cloud infrastructure: both our client and their partner were using the exact same private CIDR range (10.0.0.0/22).


[diagram] A simple diagram showing two clouds, "Client VPC" and "Partner VPC." Both have the label "CIDR: 10.0.0.0/22". A VPN tunnel connects them with a large red "X" over it, labeled "IP CONFLICT." [diagram]

Imagine telling a mail carrier to deliver a package to '123 Main Street' in a town where two different houses have that exact same address. Where does the package go? It goes nowhere.

In networking, when a router sees a packet for an IP address that exists on both sides of a connection, it has no idea where to send it. This is a common issue in VPC peering and VPN setups that can halt all communication .

First Attempt: A NAT gateway for Outbound Traffic

The initial thought was simple. Let's just give our client a new, unique address. We added a secondary IP range, 10.120.0.0/22, to their network. The plan was to make all traffic heading to the partner look like it came from this new, unique address space.

This solved half the problem. The outbound half.

We set up a NAT (Network Address Translation) Gateway in the new subnet. A NAT Gateway acts like a receptionist for your network. When an application in the old 10.0.x.x range sent data to the partner, the NAT Gateway would change the "from" address to its own IP in the new 10.120.x.x range before sending it out.


[diagram] The flow looked like this: Application (10.0.x.x) → NAT Gateway (10.120.x.x) → VPN → Partner The path should be highlighted with a green arrow labeled "Outbound: SUCCESS. [diagram]

It worked! The partner received the request from an IP address they could recognize. We had outbound communication.

The Real Problem: Asymmetric routing breaks inbound connections

When the partner’s system tried to send data back, the connection would break. This is where many VPN troubleshooting efforts get stuck.

Here’s why: The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x subnet. The application got the request and sent a response. But the response took a different path. Our new routing rules forced it through the NAT Gateway, which again changed the source IP to the 10.120.x.x range.

From the partner's perspective, this was bizarre. They sent a message to an address in the 10.0.x.x range but got a reply from a completely different address in the 10.120.x.x range. This is known as asymmetric routing, and it’s a deal-breaker for stateful connections like TCP.

Imagine calling a friend from your phone, but they call you back from a blocked number to give you the answer. You’d probably ignore it, right? That’s exactly what the partner’s firewall did. It saw a response from an IP address it never sent a request to, thought it was suspicious, and dropped the connection.


A diagram showing two paths. Path 1 (Request): Partner → VPN → ALB (10.0.x.x) → App. Path 2 (Response): App → NAT Gateway (10.120.x.x) → VPN → Partner. The response path should have a large red "X" on it, labeled "CONNECTION FAILED: Source IP Mismatch."

The search for a solution: other attempts we tried

Before we found the right fix, we explored a few other common approaches.

Attempt #1: The "Just Change Everything" Idea The most "obvious" solution is to just change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the right call. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.

Attempt #2: The Network Load Balancer (NLB) Our next thought was to use a Network Load Balancer (NLB). Theoretically, this is a great fit. Unlike an ALB, an NLB can preserve the client's source IP address, which would solve the asymmetric routing problem. We set one up and manually configured the rules. It worked!

But then, the next time the developers deployed their application, it broke again. The client's infrastructure was managed by a Kubernetes ALB Ingress Controller. This piece of automation is great for managing web traffic, but it's also very particular. It saw our manual ALB rules, didn't recognize them as part of its configuration, and simply deleted them.

The final solution: Working with automation, not against it

Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand.

So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.0.0/22 subnet.

  1. Dedicated ALB: This new ALB was created just to handle traffic from this partner.

  2. Updated Ingress Rules: We updated the Kubernetes ingress configuration for the partner-facing service to use this new ALB. This was a change the automation understood and accepted.

  3. Clean Traffic Flow: The domain the partner connects to now resolves directly to the new ALB in the 10.120.x.x range.

Now, the entire conversation, both request and response, happens through an endpoint in the non-conflicting 10.120.x.x range. The connection is stable. Best of all, the solution is defined in code and managed by the existing automation, making it a robust and maintainable fix.

A clean diagram showing a symmetric flow: Partner ↔ VPN ↔ New ALB (in 10.120.x.x subnet) ↔ Application (in 10.0.x.x subnet). The entire path should have a green, double-headed arrow labeled "SUCCESSFUL CONNECTION."


Key takeaways for troubleshooting cloud networking issues

This was a powerful reminder of a few core engineering truths for anyone dealing with cloud networking and scalability:

  • Look Beyond the Obvious Fix: The simplest solution often only solves half the problem. You have to trace the entire path of a request and its response to see the full picture and diagnose issues like asymmetric routing.

  • Don't Fight Your Automation: When a manual fix conflicts with your automation, it’s a sign that the fix is wrong, not the automation. The better, more resilient solution is always one that works with your automated systems, following DevOps principles.

  • Pragmatism Wins: The goal isn't theoretical perfection. It's about solving the business problem in a way that is reliable, secure, and maintainable.

Networking puzzles like these can be frustrating, but they are also incredibly satisfying to solve. By digging past the initial symptoms and understanding the complete system, we were able to untangle the knot and build a solution that just works.

If you're facing your own networking knots or challenges with scaling your infrastructure on AWS, we'd love to help. Reach out to us at One2N. We enjoy a good puzzle.

Sometimes, you come across a problem that seems straightforward but quickly turns into a complex networking puzzle. This is one of those stories. It all starts with a common business need: connecting two different networks securely. For one of our clients in the health-tech industry, setting up a Site-to-Site VPN with a key partner wasn’t just a “nice-to-have”, it was essential for their operations.

The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch. Nothing. The connection was dead on arrival.

Diagnosing the problem

DIG - output

/ping service

curl -v

The culprit was a classic networking nightmare for anyone managing cloud infrastructure: both our client and their partner were using the exact same private CIDR range (10.0.0.0/22).


[diagram] A simple diagram showing two clouds, "Client VPC" and "Partner VPC." Both have the label "CIDR: 10.0.0.0/22". A VPN tunnel connects them with a large red "X" over it, labeled "IP CONFLICT." [diagram]

Imagine telling a mail carrier to deliver a package to '123 Main Street' in a town where two different houses have that exact same address. Where does the package go? It goes nowhere.

In networking, when a router sees a packet for an IP address that exists on both sides of a connection, it has no idea where to send it. This is a common issue in VPC peering and VPN setups that can halt all communication .

First Attempt: A NAT gateway for Outbound Traffic

The initial thought was simple. Let's just give our client a new, unique address. We added a secondary IP range, 10.120.0.0/22, to their network. The plan was to make all traffic heading to the partner look like it came from this new, unique address space.

This solved half the problem. The outbound half.

We set up a NAT (Network Address Translation) Gateway in the new subnet. A NAT Gateway acts like a receptionist for your network. When an application in the old 10.0.x.x range sent data to the partner, the NAT Gateway would change the "from" address to its own IP in the new 10.120.x.x range before sending it out.


[diagram] The flow looked like this: Application (10.0.x.x) → NAT Gateway (10.120.x.x) → VPN → Partner The path should be highlighted with a green arrow labeled "Outbound: SUCCESS. [diagram]

It worked! The partner received the request from an IP address they could recognize. We had outbound communication.

The Real Problem: Asymmetric routing breaks inbound connections

When the partner’s system tried to send data back, the connection would break. This is where many VPN troubleshooting efforts get stuck.

Here’s why: The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x subnet. The application got the request and sent a response. But the response took a different path. Our new routing rules forced it through the NAT Gateway, which again changed the source IP to the 10.120.x.x range.

From the partner's perspective, this was bizarre. They sent a message to an address in the 10.0.x.x range but got a reply from a completely different address in the 10.120.x.x range. This is known as asymmetric routing, and it’s a deal-breaker for stateful connections like TCP.

Imagine calling a friend from your phone, but they call you back from a blocked number to give you the answer. You’d probably ignore it, right? That’s exactly what the partner’s firewall did. It saw a response from an IP address it never sent a request to, thought it was suspicious, and dropped the connection.


A diagram showing two paths. Path 1 (Request): Partner → VPN → ALB (10.0.x.x) → App. Path 2 (Response): App → NAT Gateway (10.120.x.x) → VPN → Partner. The response path should have a large red "X" on it, labeled "CONNECTION FAILED: Source IP Mismatch."

The search for a solution: other attempts we tried

Before we found the right fix, we explored a few other common approaches.

Attempt #1: The "Just Change Everything" Idea The most "obvious" solution is to just change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the right call. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.

Attempt #2: The Network Load Balancer (NLB) Our next thought was to use a Network Load Balancer (NLB). Theoretically, this is a great fit. Unlike an ALB, an NLB can preserve the client's source IP address, which would solve the asymmetric routing problem. We set one up and manually configured the rules. It worked!

But then, the next time the developers deployed their application, it broke again. The client's infrastructure was managed by a Kubernetes ALB Ingress Controller. This piece of automation is great for managing web traffic, but it's also very particular. It saw our manual ALB rules, didn't recognize them as part of its configuration, and simply deleted them.

The final solution: Working with automation, not against it

Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand.

So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.0.0/22 subnet.

  1. Dedicated ALB: This new ALB was created just to handle traffic from this partner.

  2. Updated Ingress Rules: We updated the Kubernetes ingress configuration for the partner-facing service to use this new ALB. This was a change the automation understood and accepted.

  3. Clean Traffic Flow: The domain the partner connects to now resolves directly to the new ALB in the 10.120.x.x range.

Now, the entire conversation, both request and response, happens through an endpoint in the non-conflicting 10.120.x.x range. The connection is stable. Best of all, the solution is defined in code and managed by the existing automation, making it a robust and maintainable fix.

A clean diagram showing a symmetric flow: Partner ↔ VPN ↔ New ALB (in 10.120.x.x subnet) ↔ Application (in 10.0.x.x subnet). The entire path should have a green, double-headed arrow labeled "SUCCESSFUL CONNECTION."


Key takeaways for troubleshooting cloud networking issues

This was a powerful reminder of a few core engineering truths for anyone dealing with cloud networking and scalability:

  • Look Beyond the Obvious Fix: The simplest solution often only solves half the problem. You have to trace the entire path of a request and its response to see the full picture and diagnose issues like asymmetric routing.

  • Don't Fight Your Automation: When a manual fix conflicts with your automation, it’s a sign that the fix is wrong, not the automation. The better, more resilient solution is always one that works with your automated systems, following DevOps principles.

  • Pragmatism Wins: The goal isn't theoretical perfection. It's about solving the business problem in a way that is reliable, secure, and maintainable.

Networking puzzles like these can be frustrating, but they are also incredibly satisfying to solve. By digging past the initial symptoms and understanding the complete system, we were able to untangle the knot and build a solution that just works.

If you're facing your own networking knots or challenges with scaling your infrastructure on AWS, we'd love to help. Reach out to us at One2N. We enjoy a good puzzle.

Sometimes, you come across a problem that seems straightforward but quickly turns into a complex networking puzzle. This is one of those stories. It all starts with a common business need: connecting two different networks securely. For one of our clients in the health-tech industry, setting up a Site-to-Site VPN with a key partner wasn’t just a “nice-to-have”, it was essential for their operations.

The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch. Nothing. The connection was dead on arrival.

Diagnosing the problem

DIG - output

/ping service

curl -v

The culprit was a classic networking nightmare for anyone managing cloud infrastructure: both our client and their partner were using the exact same private CIDR range (10.0.0.0/22).


[diagram] A simple diagram showing two clouds, "Client VPC" and "Partner VPC." Both have the label "CIDR: 10.0.0.0/22". A VPN tunnel connects them with a large red "X" over it, labeled "IP CONFLICT." [diagram]

Imagine telling a mail carrier to deliver a package to '123 Main Street' in a town where two different houses have that exact same address. Where does the package go? It goes nowhere.

In networking, when a router sees a packet for an IP address that exists on both sides of a connection, it has no idea where to send it. This is a common issue in VPC peering and VPN setups that can halt all communication .

First Attempt: A NAT gateway for Outbound Traffic

The initial thought was simple. Let's just give our client a new, unique address. We added a secondary IP range, 10.120.0.0/22, to their network. The plan was to make all traffic heading to the partner look like it came from this new, unique address space.

This solved half the problem. The outbound half.

We set up a NAT (Network Address Translation) Gateway in the new subnet. A NAT Gateway acts like a receptionist for your network. When an application in the old 10.0.x.x range sent data to the partner, the NAT Gateway would change the "from" address to its own IP in the new 10.120.x.x range before sending it out.


[diagram] The flow looked like this: Application (10.0.x.x) → NAT Gateway (10.120.x.x) → VPN → Partner The path should be highlighted with a green arrow labeled "Outbound: SUCCESS. [diagram]

It worked! The partner received the request from an IP address they could recognize. We had outbound communication.

The Real Problem: Asymmetric routing breaks inbound connections

When the partner’s system tried to send data back, the connection would break. This is where many VPN troubleshooting efforts get stuck.

Here’s why: The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x subnet. The application got the request and sent a response. But the response took a different path. Our new routing rules forced it through the NAT Gateway, which again changed the source IP to the 10.120.x.x range.

From the partner's perspective, this was bizarre. They sent a message to an address in the 10.0.x.x range but got a reply from a completely different address in the 10.120.x.x range. This is known as asymmetric routing, and it’s a deal-breaker for stateful connections like TCP.

Imagine calling a friend from your phone, but they call you back from a blocked number to give you the answer. You’d probably ignore it, right? That’s exactly what the partner’s firewall did. It saw a response from an IP address it never sent a request to, thought it was suspicious, and dropped the connection.


A diagram showing two paths. Path 1 (Request): Partner → VPN → ALB (10.0.x.x) → App. Path 2 (Response): App → NAT Gateway (10.120.x.x) → VPN → Partner. The response path should have a large red "X" on it, labeled "CONNECTION FAILED: Source IP Mismatch."

The search for a solution: other attempts we tried

Before we found the right fix, we explored a few other common approaches.

Attempt #1: The "Just Change Everything" Idea The most "obvious" solution is to just change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the right call. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.

Attempt #2: The Network Load Balancer (NLB) Our next thought was to use a Network Load Balancer (NLB). Theoretically, this is a great fit. Unlike an ALB, an NLB can preserve the client's source IP address, which would solve the asymmetric routing problem. We set one up and manually configured the rules. It worked!

But then, the next time the developers deployed their application, it broke again. The client's infrastructure was managed by a Kubernetes ALB Ingress Controller. This piece of automation is great for managing web traffic, but it's also very particular. It saw our manual ALB rules, didn't recognize them as part of its configuration, and simply deleted them.

The final solution: Working with automation, not against it

Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand.

So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.0.0/22 subnet.

  1. Dedicated ALB: This new ALB was created just to handle traffic from this partner.

  2. Updated Ingress Rules: We updated the Kubernetes ingress configuration for the partner-facing service to use this new ALB. This was a change the automation understood and accepted.

  3. Clean Traffic Flow: The domain the partner connects to now resolves directly to the new ALB in the 10.120.x.x range.

Now, the entire conversation, both request and response, happens through an endpoint in the non-conflicting 10.120.x.x range. The connection is stable. Best of all, the solution is defined in code and managed by the existing automation, making it a robust and maintainable fix.

A clean diagram showing a symmetric flow: Partner ↔ VPN ↔ New ALB (in 10.120.x.x subnet) ↔ Application (in 10.0.x.x subnet). The entire path should have a green, double-headed arrow labeled "SUCCESSFUL CONNECTION."


Key takeaways for troubleshooting cloud networking issues

This was a powerful reminder of a few core engineering truths for anyone dealing with cloud networking and scalability:

  • Look Beyond the Obvious Fix: The simplest solution often only solves half the problem. You have to trace the entire path of a request and its response to see the full picture and diagnose issues like asymmetric routing.

  • Don't Fight Your Automation: When a manual fix conflicts with your automation, it’s a sign that the fix is wrong, not the automation. The better, more resilient solution is always one that works with your automated systems, following DevOps principles.

  • Pragmatism Wins: The goal isn't theoretical perfection. It's about solving the business problem in a way that is reliable, secure, and maintainable.

Networking puzzles like these can be frustrating, but they are also incredibly satisfying to solve. By digging past the initial symptoms and understanding the complete system, we were able to untangle the knot and build a solution that just works.

If you're facing your own networking knots or challenges with scaling your infrastructure on AWS, we'd love to help. Reach out to us at One2N. We enjoy a good puzzle.

Share

Jump to section

Related posts

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.