Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#AWS

#Networking

Jul 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

Back to Blog

#AWS

#Networking

Jul 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

Back to Blog

#AWS

#Networking

Jul 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

Back to Blog

#AWS

#Networking

Jul 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

Sometimes, you encounter a problem that seems straightforward at a glance but quickly spirals into a complex challenge. This is one of those stories. It began with a common business need: connecting two different networks securely.

For one of our clients, their business operations depended on a seamless data exchange with a key partner. The workflow was critical:

The CLIENT AWS infrastructure needed to receive raw data from their PARTNER On-Prem systems through APIs.
This data was then processed by internal services to generate comprehensive reports.
Finally, the generated reports had to be sent back automatically to the partner's systems.

To achieve this level of integration while maintaining security and compliance, a Site-to-Site VPN connection was the clear solution.

Think of it as a secure, private tunnel between two office buildings. Employees in Building A can access resources in Building B as if they were in the same location, with all communication traveling through an encrypted channel over the internet.

The client's infrastructure was hosted on AWS, and we had successfully implemented similar connections before using AWS Site-to-Site VPN. The architecture was standard:

AWS VPN Gateway: The client's AWS infrastructure hosts the VPN gateway.
Partner Gateway: The partner's on-premises network connects through their gateway.
Encryption: All traffic flows through an encrypted IPsec tunnel managed by AWS.

The setup process was routine: configure the VPN gateways on both ends, exchange security credentials, establish the tunnel, and update the routing tables. But this time was different.

The moment of truth, when everything failed

The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch.

What we expected to see:

What actually happened:

The VPN tunnel itself was established; monitoring dashboards showed it was "up." But when applications tried to communicate through it, every request failed with a timeout.

A classic networking nightmare

After diving into network logs and routing tables, we discovered the root cause. Both the CLIENT AWS and the PARTNER On-Prem were using the exact same private IP address range: 10.0.0.0/22.

Imagine telling a mail carrier to deliver a package to "123 Main Street" in a town where two different houses share that exact address. The carrier has no idea which house you mean, so the package goes nowhere.

In networking, when a router sees a packet destined for an IP address that exists on both sides of a connection, it defaults to the local route. This is a common issue in VPC peering and VPN setups that can completely halt communication.

A simple fix that wasn't

The PARTNER On-Prem requested that the CLIENT AWS change their CIDR range. The initial thought was straightforward: let's introduce a new, non-overlapping secondary CIDR block, 10.120.0.0/23, to the client's existing Virtual Private Cloud (VPC). The plan was to route all traffic destined for the partner through this new, unique address space.

We spun up a new subnet in this CIDR and set up a NAT (Network Address Translation) Gateway. A NAT Gateway acts like a receptionist for your network; it takes outgoing requests from the original IP range and makes them appear to come from the new IP range.

This worked for outbound traffic:

The CLIENT AWS app sends a request from its original 10.0.x.x address.
The NAT Gateway receives it and translates the source IP to a 10.120.x.x address.
The PARTNER On-Prem receives the request from 10.120.x.x (which it recognizes) and responds.

We had outbound communication working.

The real puzzle with inbound traffic

Success was short-lived. When the partner's system tried to access our client's services, we discovered a new, more subtle problem: asymmetric routing.

The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x subnet. The application received the request and sent a response. However, the response going back to the partner took a different path. Our new routing rules forced it through the NAT Gateway, which changed the source IP to the 10.120.x.x range.

Here’s what the partner’s stateful firewall saw:

Step 1: Request Sent PARTNER On-Prem (10.0.1.200) → CLIENT AWS Load Balancer (10.0.2.50)
Step 2: Response Received CLIENT AWS (10.120.1.1 via NAT) → PARTNER On-Prem (10.0.1.200)

From the partner's perspective, this was a security threat:

"I sent a request TO 10.0.2.50."
"I got a response FROM 10.120.1.1."
"These don't match. This looks like an attack!"

Result: ❌ Connection dropped.

Their stateful firewall saw a response from an IP address it never sent a request to, flagged it as suspicious, and dropped the connection. It was like calling someone on the phone but having a complete stranger call you back with the answer.

The final solution: building a parallel system

Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand and maintain.

So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.x.x/xsubnet.

What we built

Dedicated ALB in New Subnet: We created a new ALB specifically for partner traffic, placed in the 10.120.x.x subnet.

Partner-Specific Domain: We set up partner-api.client.com that resolves to the new ALB.
Kubernetes-Managed Configuration: We updated the ingress rules to manage this new ALB. Now, the automation understood and maintained the entire setup.
Clean Traffic Flow: All communication from the PARTNER On-Prem: both request and response now happens exclusively through the 10.120.x.x range.

We realized we didn’t need to fix the existing system; we needed a parallel one. Instead of renovating a busy highway, we built a dedicated express lane for VIP traffic. Regular traffic continued uninterrupted, while partner traffic got a clean, reliable path.

The solution was elegant because it offered:
Zero downtime for existing users.
Complete isolation of partner traffic from IP conflicts.
Automation-friendliness, with no manual configurations to be deleted.
Scalability for future partners.

Technically, this worked because partner traffic now enters and exits through the same IP range (10.120.x.x), eliminating asymmetric routing, and the entire solution is defined in Kubernetes ingress configurations that our automation understands and maintains.

Key takeaways

This was a powerful reminder of a few core engineering truths :

Look beyond the "simple" fix: The most obvious solution often solves only part of the problem. Always trace the complete data flow to uncover hidden complexities like asymmetric routing.
Automation is king, so don't fight it: When a manual fix conflicts with automation, the fix is wrong, not the automation. The best solution always works with your automated systems, not against them.
Sometimes the best fix is a new path: Instead of trying to modify a complex, live system, creating a parallel, isolated solution can be simpler, safer, and more maintainable.
Pragmatism wins: We could have architected a far more complex networking solution. But the goal wasn't theoretical perfection; it was to solve the client's business problem in a way that was reliable, secure, and maintainable for their team.

Networking puzzles like these can be frustrating, but they're also incredibly satisfying to solve. By understanding the complete system and working with existing automation, we built a solution that just works.

If you're facing your own networking knots or challenges with scaling your infrastructure, we'd love to help. Reach out to us at One2N we enjoy a good puzzle.

For one of our clients, their business operations depended on a seamless data exchange with a key partner. The workflow was critical:

The CLIENT AWS infrastructure needed to receive raw data from their PARTNER On-Prem systems through APIs.
This data was then processed by internal services to generate comprehensive reports.
Finally, the generated reports had to be sent back automatically to the partner's systems.

To achieve this level of integration while maintaining security and compliance, a Site-to-Site VPN connection was the clear solution.

The client's infrastructure was hosted on AWS, and we had successfully implemented similar connections before using AWS Site-to-Site VPN. The architecture was standard:

AWS VPN Gateway: The client's AWS infrastructure hosts the VPN gateway.
Partner Gateway: The partner's on-premises network connects through their gateway.
Encryption: All traffic flows through an encrypted IPsec tunnel managed by AWS.

The setup process was routine: configure the VPN gateways on both ends, exchange security credentials, establish the tunnel, and update the routing tables. But this time was different.

The moment of truth, when everything failed

The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch.

What we expected to see:

What actually happened:

The VPN tunnel itself was established; monitoring dashboards showed it was "up." But when applications tried to communicate through it, every request failed with a timeout.