Not Published
Sometimes, you encounter a problem that seems straightforward at a glance but quickly spirals into a complex challenge. This is one of those stories. It began with a common business need: connecting two different networks securely.
For one of our clients, their business operations depended on a seamless data exchange with a key partner. The workflow was critical:
The CLIENT AWS infrastructure needed to receive raw data from their PARTNER On-Prem systems through APIs.
This data was then processed by internal services to generate comprehensive reports.
Finally, the generated reports had to be sent back automatically to the partner's systems.
To achieve this level of integration while maintaining security and compliance, a Site-to-Site VPN connection was the clear solution.

Think of it as a secure, private tunnel between two office buildings. Employees in Building A can access resources in Building B as if they were in the same location, with all communication traveling through an encrypted channel over the internet.
The client's infrastructure was hosted on AWS, and we had successfully implemented similar connections before using AWS Site-to-Site VPN. The architecture was standard:
AWS VPN Gateway: The client's AWS infrastructure hosts the VPN gateway.
Customer Gateway: The partner's on-premises network connects through their gateway.
Encryption: All traffic flows through an encrypted IPsec tunnel managed by AWS.
The setup process was routine: configure the VPN gateways on both ends, exchange security credentials, establish the tunnel, and update the routing tables. But this time was different.
The moment of truth, when everything failed
The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch.
What we expected to see:
What actually happened:
The VPN tunnel itself was established; monitoring dashboards showed it was "up." But when applications tried to communicate through it, every request failed with a timeout.
A classic networking nightmare
After diving into network logs and routing tables, we discovered the root cause. Both the CLIENT AWS and the PARTNER On-Prem were using the exact same private IP address range: 10.0.0.0/22
.

Imagine telling a mail carrier to deliver a package to "123 Main Street" in a town where two different houses share that exact address. The carrier has no idea which house you mean, so the package goes nowhere.

In networking, when a router sees a packet destined for an IP address that exists on both sides of a connection, it defaults to the local route. This is a common issue in VPC peering and VPN setups that can completely halt communication.
A simple fix that wasn't
The PARTNER On-Prem requested that the CLIENT AWS change their CIDR range. The initial thought was straightforward: let's introduce a new, non-overlapping secondary CIDR block, 10.120.0.0/23
, to the client's existing Virtual Private Cloud (VPC). The plan was to route all traffic destined for the partner through this new, unique address space.
We spun up a new subnet in this CIDR and set up a NAT (Network Address Translation) Gateway. A NAT Gateway acts like a receptionist for your network; it takes outgoing requests from the original IP range and makes them appear to come from the new IP range.
This worked for outbound traffic:
The CLIENT AWS app sends a request from its original
10.0.x.x
address.The NAT Gateway receives it and translates the source IP to a
10.120.x.x
address.The PARTNER On-Prem receives the request from
10.120.x.x
(which it recognizes) and responds.
We had outbound communication working.

The real puzzle with inbound traffic
Success was short-lived. When the partner's system tried to access our client's services, we discovered a new, more subtle problem: asymmetric routing.
The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x
subnet. The application received the request and sent a response. However, the response going back to the partner took a different path. Our new routing rules forced it through the NAT Gateway, which changed the source IP to the 10.120.x.x
range.
Here’s what the partner’s stateful firewall saw:
Step 1: Request Sent PARTNER On-Prem (
10.0.1.200
) → CLIENT AWS Load Balancer (10.0.2.50
)Step 2: Response Received CLIENT AWS (
10.120.1.1
via NAT) → PARTNER On-Prem (10.0.1.200
)
From the partner's perspective, this was a security threat:
"I sent a request TO 10.0.2.50."
"I got a response FROM 10.120.1.1."
"These don't match. This looks like an attack!"
Result: ❌ Connection dropped.
Their stateful firewall saw a response from an IP address it never sent a request to, flagged it as suspicious, and dropped the connection. It was like calling someone on the phone but having a complete stranger call you back with the answer.
Other solutions we tried
Before finding the right fix, we explored other common approaches.
Attempt #1: The "just change everything" idea
The most "obvious" solution is to change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the correct approach. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.
Attempt #2: The Network Load Balancer (NLB) approach
Our next idea was to place a Network Load Balancer (NLB) in front of our existing Application Load Balancer (ALB). The architecture was:
An NLB sits in the new
10.120.x.x
subnet.The NLB forwards all traffic to the existing ALB, which is configured as its target group.
The ALB uses IP-based rules: "If traffic comes from the NLB's IP range, route it to the backend services."
This seemed promising because it would solve the asymmetric routing problem. The request would come in via the NLB and the response would flow back through the exact same path. We manually configured these IP-based routing rules on the ALB, and it worked perfectly.

The automation problem
But then disaster struck. The client's infrastructure uses a Kubernetes ALB Ingress Controller that automatically manages ALB rules based on ingress configurations. When developers deployed the next day, this automation did exactly what it was designed to do:
Scanned the ALB configuration.
Found our manual IP-based rules, which it didn't recognize as part of its managed state.
Reset the ALB to its expected configuration, deleting our custom rules.
Broke the connection again.
We tried adding the IP-based rules directly to the Kubernetes ingress configuration, but the ALB Ingress Controller primarily supports path-based (/api/users
) or host-based (api.example.com
) routing, not the IP-based rules we needed for this workaround.
The final solution: building a parallel system
Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand and maintain.
So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.x.x/x
subnet.
What we built

Dedicated ALB in New Subnet: We created a new ALB specifically for partner traffic, placed in the 10.120.x.x
subnet.
Partner-Specific Domain: We set up
partner-api.client.com
that resolves to the new ALB.Kubernetes-Managed Configuration: We updated the ingress rules to manage this new ALB. Now, the automation understood and maintained the entire setup.
Clean Traffic Flow: All communication from the PARTNER On-Prem: both request and response now happens exclusively through the
10.120.x.x
range.
We realized we didn’t need to fix the existing system; we needed a parallel one. Instead of renovating a busy highway, we built a dedicated express lane for VIP traffic. Regular traffic continued uninterrupted, while partner traffic got a clean, reliable path.
The solution was elegant because it offered:
Zero downtime for existing users.
Complete isolation of partner traffic from IP conflicts.
Automation-friendliness, with no manual configurations to be deleted.
Scalability for future partners.
Technically, this worked because partner traffic now enters and exits through the same IP range (10.120.x.x
), eliminating asymmetric routing, and the entire solution is defined in Kubernetes ingress configurations that our automation understands and maintains.
Key takeaways
This was a powerful reminder of a few core engineering truths :
Look beyond the "simple" fix: The most obvious solution often solves only part of the problem. Always trace the complete data flow to uncover hidden complexities like asymmetric routing.
Automation is king, so don't fight it: When a manual fix conflicts with automation, the fix is wrong, not the automation. The best solution always works with your automated systems, not against them.
Sometimes the best fix is a new path: Instead of trying to modify a complex, live system, creating a parallel, isolated solution can be simpler, safer, and more maintainable.
Pragmatism wins: We could have architected a far more complex networking solution. But the goal wasn't theoretical perfection; it was to solve the client's business problem in a way that was reliable, secure, and maintainable for their team.
Networking puzzles like these can be frustrating, but they're also incredibly satisfying to solve. By understanding the complete system and working with existing automation, we built a solution that just works.
If you're facing your own networking knots or challenges with scaling your infrastructure, we'd love to help. Reach out to us at One2N we enjoy a good puzzle.
Sometimes, you encounter a problem that seems straightforward at a glance but quickly spirals into a complex challenge. This is one of those stories. It began with a common business need: connecting two different networks securely.
For one of our clients, their business operations depended on a seamless data exchange with a key partner. The workflow was critical:
The CLIENT AWS infrastructure needed to receive raw data from their PARTNER On-Prem systems through APIs.
This data was then processed by internal services to generate comprehensive reports.
Finally, the generated reports had to be sent back automatically to the partner's systems.
To achieve this level of integration while maintaining security and compliance, a Site-to-Site VPN connection was the clear solution.

Think of it as a secure, private tunnel between two office buildings. Employees in Building A can access resources in Building B as if they were in the same location, with all communication traveling through an encrypted channel over the internet.
The client's infrastructure was hosted on AWS, and we had successfully implemented similar connections before using AWS Site-to-Site VPN. The architecture was standard:
AWS VPN Gateway: The client's AWS infrastructure hosts the VPN gateway.
Customer Gateway: The partner's on-premises network connects through their gateway.
Encryption: All traffic flows through an encrypted IPsec tunnel managed by AWS.
The setup process was routine: configure the VPN gateways on both ends, exchange security credentials, establish the tunnel, and update the routing tables. But this time was different.
The moment of truth, when everything failed
The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch.
What we expected to see:
What actually happened:
The VPN tunnel itself was established; monitoring dashboards showed it was "up." But when applications tried to communicate through it, every request failed with a timeout.
A classic networking nightmare
After diving into network logs and routing tables, we discovered the root cause. Both the CLIENT AWS and the PARTNER On-Prem were using the exact same private IP address range: 10.0.0.0/22
.

Imagine telling a mail carrier to deliver a package to "123 Main Street" in a town where two different houses share that exact address. The carrier has no idea which house you mean, so the package goes nowhere.

In networking, when a router sees a packet destined for an IP address that exists on both sides of a connection, it defaults to the local route. This is a common issue in VPC peering and VPN setups that can completely halt communication.
A simple fix that wasn't
The PARTNER On-Prem requested that the CLIENT AWS change their CIDR range. The initial thought was straightforward: let's introduce a new, non-overlapping secondary CIDR block, 10.120.0.0/23
, to the client's existing Virtual Private Cloud (VPC). The plan was to route all traffic destined for the partner through this new, unique address space.
We spun up a new subnet in this CIDR and set up a NAT (Network Address Translation) Gateway. A NAT Gateway acts like a receptionist for your network; it takes outgoing requests from the original IP range and makes them appear to come from the new IP range.
This worked for outbound traffic:
The CLIENT AWS app sends a request from its original
10.0.x.x
address.The NAT Gateway receives it and translates the source IP to a
10.120.x.x
address.The PARTNER On-Prem receives the request from
10.120.x.x
(which it recognizes) and responds.
We had outbound communication working.

The real puzzle with inbound traffic
Success was short-lived. When the partner's system tried to access our client's services, we discovered a new, more subtle problem: asymmetric routing.
The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x
subnet. The application received the request and sent a response. However, the response going back to the partner took a different path. Our new routing rules forced it through the NAT Gateway, which changed the source IP to the 10.120.x.x
range.
Here’s what the partner’s stateful firewall saw:
Step 1: Request Sent PARTNER On-Prem (
10.0.1.200
) → CLIENT AWS Load Balancer (10.0.2.50
)Step 2: Response Received CLIENT AWS (
10.120.1.1
via NAT) → PARTNER On-Prem (10.0.1.200
)
From the partner's perspective, this was a security threat:
"I sent a request TO 10.0.2.50."
"I got a response FROM 10.120.1.1."
"These don't match. This looks like an attack!"
Result: ❌ Connection dropped.
Their stateful firewall saw a response from an IP address it never sent a request to, flagged it as suspicious, and dropped the connection. It was like calling someone on the phone but having a complete stranger call you back with the answer.
Other solutions we tried
Before finding the right fix, we explored other common approaches.
Attempt #1: The "just change everything" idea
The most "obvious" solution is to change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the correct approach. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.
Attempt #2: The Network Load Balancer (NLB) approach
Our next idea was to place a Network Load Balancer (NLB) in front of our existing Application Load Balancer (ALB). The architecture was:
An NLB sits in the new
10.120.x.x
subnet.The NLB forwards all traffic to the existing ALB, which is configured as its target group.
The ALB uses IP-based rules: "If traffic comes from the NLB's IP range, route it to the backend services."
This seemed promising because it would solve the asymmetric routing problem. The request would come in via the NLB and the response would flow back through the exact same path. We manually configured these IP-based routing rules on the ALB, and it worked perfectly.

The automation problem
But then disaster struck. The client's infrastructure uses a Kubernetes ALB Ingress Controller that automatically manages ALB rules based on ingress configurations. When developers deployed the next day, this automation did exactly what it was designed to do:
Scanned the ALB configuration.
Found our manual IP-based rules, which it didn't recognize as part of its managed state.
Reset the ALB to its expected configuration, deleting our custom rules.
Broke the connection again.
We tried adding the IP-based rules directly to the Kubernetes ingress configuration, but the ALB Ingress Controller primarily supports path-based (/api/users
) or host-based (api.example.com
) routing, not the IP-based rules we needed for this workaround.
The final solution: building a parallel system
Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand and maintain.
So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.x.x/x
subnet.
What we built

Dedicated ALB in New Subnet: We created a new ALB specifically for partner traffic, placed in the 10.120.x.x
subnet.
Partner-Specific Domain: We set up
partner-api.client.com
that resolves to the new ALB.Kubernetes-Managed Configuration: We updated the ingress rules to manage this new ALB. Now, the automation understood and maintained the entire setup.
Clean Traffic Flow: All communication from the PARTNER On-Prem: both request and response now happens exclusively through the
10.120.x.x
range.
We realized we didn’t need to fix the existing system; we needed a parallel one. Instead of renovating a busy highway, we built a dedicated express lane for VIP traffic. Regular traffic continued uninterrupted, while partner traffic got a clean, reliable path.
The solution was elegant because it offered:
Zero downtime for existing users.
Complete isolation of partner traffic from IP conflicts.
Automation-friendliness, with no manual configurations to be deleted.
Scalability for future partners.
Technically, this worked because partner traffic now enters and exits through the same IP range (10.120.x.x
), eliminating asymmetric routing, and the entire solution is defined in Kubernetes ingress configurations that our automation understands and maintains.
Key takeaways
This was a powerful reminder of a few core engineering truths :
Look beyond the "simple" fix: The most obvious solution often solves only part of the problem. Always trace the complete data flow to uncover hidden complexities like asymmetric routing.
Automation is king, so don't fight it: When a manual fix conflicts with automation, the fix is wrong, not the automation. The best solution always works with your automated systems, not against them.
Sometimes the best fix is a new path: Instead of trying to modify a complex, live system, creating a parallel, isolated solution can be simpler, safer, and more maintainable.
Pragmatism wins: We could have architected a far more complex networking solution. But the goal wasn't theoretical perfection; it was to solve the client's business problem in a way that was reliable, secure, and maintainable for their team.
Networking puzzles like these can be frustrating, but they're also incredibly satisfying to solve. By understanding the complete system and working with existing automation, we built a solution that just works.
If you're facing your own networking knots or challenges with scaling your infrastructure, we'd love to help. Reach out to us at One2N we enjoy a good puzzle.
Sometimes, you encounter a problem that seems straightforward at a glance but quickly spirals into a complex challenge. This is one of those stories. It began with a common business need: connecting two different networks securely.
For one of our clients, their business operations depended on a seamless data exchange with a key partner. The workflow was critical:
The CLIENT AWS infrastructure needed to receive raw data from their PARTNER On-Prem systems through APIs.
This data was then processed by internal services to generate comprehensive reports.
Finally, the generated reports had to be sent back automatically to the partner's systems.
To achieve this level of integration while maintaining security and compliance, a Site-to-Site VPN connection was the clear solution.

Think of it as a secure, private tunnel between two office buildings. Employees in Building A can access resources in Building B as if they were in the same location, with all communication traveling through an encrypted channel over the internet.
The client's infrastructure was hosted on AWS, and we had successfully implemented similar connections before using AWS Site-to-Site VPN. The architecture was standard:
AWS VPN Gateway: The client's AWS infrastructure hosts the VPN gateway.
Customer Gateway: The partner's on-premises network connects through their gateway.
Encryption: All traffic flows through an encrypted IPsec tunnel managed by AWS.
The setup process was routine: configure the VPN gateways on both ends, exchange security credentials, establish the tunnel, and update the routing tables. But this time was different.
The moment of truth, when everything failed
The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch.
What we expected to see:
What actually happened:
The VPN tunnel itself was established; monitoring dashboards showed it was "up." But when applications tried to communicate through it, every request failed with a timeout.
A classic networking nightmare
After diving into network logs and routing tables, we discovered the root cause. Both the CLIENT AWS and the PARTNER On-Prem were using the exact same private IP address range: 10.0.0.0/22
.

Imagine telling a mail carrier to deliver a package to "123 Main Street" in a town where two different houses share that exact address. The carrier has no idea which house you mean, so the package goes nowhere.

In networking, when a router sees a packet destined for an IP address that exists on both sides of a connection, it defaults to the local route. This is a common issue in VPC peering and VPN setups that can completely halt communication.
A simple fix that wasn't
The PARTNER On-Prem requested that the CLIENT AWS change their CIDR range. The initial thought was straightforward: let's introduce a new, non-overlapping secondary CIDR block, 10.120.0.0/23
, to the client's existing Virtual Private Cloud (VPC). The plan was to route all traffic destined for the partner through this new, unique address space.
We spun up a new subnet in this CIDR and set up a NAT (Network Address Translation) Gateway. A NAT Gateway acts like a receptionist for your network; it takes outgoing requests from the original IP range and makes them appear to come from the new IP range.
This worked for outbound traffic:
The CLIENT AWS app sends a request from its original
10.0.x.x
address.The NAT Gateway receives it and translates the source IP to a
10.120.x.x
address.The PARTNER On-Prem receives the request from
10.120.x.x
(which it recognizes) and responds.
We had outbound communication working.

The real puzzle with inbound traffic
Success was short-lived. When the partner's system tried to access our client's services, we discovered a new, more subtle problem: asymmetric routing.
The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x
subnet. The application received the request and sent a response. However, the response going back to the partner took a different path. Our new routing rules forced it through the NAT Gateway, which changed the source IP to the 10.120.x.x
range.
Here’s what the partner’s stateful firewall saw:
Step 1: Request Sent PARTNER On-Prem (
10.0.1.200
) → CLIENT AWS Load Balancer (10.0.2.50
)Step 2: Response Received CLIENT AWS (
10.120.1.1
via NAT) → PARTNER On-Prem (10.0.1.200
)
From the partner's perspective, this was a security threat:
"I sent a request TO 10.0.2.50."
"I got a response FROM 10.120.1.1."
"These don't match. This looks like an attack!"
Result: ❌ Connection dropped.
Their stateful firewall saw a response from an IP address it never sent a request to, flagged it as suspicious, and dropped the connection. It was like calling someone on the phone but having a complete stranger call you back with the answer.
Other solutions we tried
Before finding the right fix, we explored other common approaches.
Attempt #1: The "just change everything" idea
The most "obvious" solution is to change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the correct approach. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.
Attempt #2: The Network Load Balancer (NLB) approach
Our next idea was to place a Network Load Balancer (NLB) in front of our existing Application Load Balancer (ALB). The architecture was:
An NLB sits in the new
10.120.x.x
subnet.The NLB forwards all traffic to the existing ALB, which is configured as its target group.
The ALB uses IP-based rules: "If traffic comes from the NLB's IP range, route it to the backend services."
This seemed promising because it would solve the asymmetric routing problem. The request would come in via the NLB and the response would flow back through the exact same path. We manually configured these IP-based routing rules on the ALB, and it worked perfectly.

The automation problem
But then disaster struck. The client's infrastructure uses a Kubernetes ALB Ingress Controller that automatically manages ALB rules based on ingress configurations. When developers deployed the next day, this automation did exactly what it was designed to do:
Scanned the ALB configuration.
Found our manual IP-based rules, which it didn't recognize as part of its managed state.
Reset the ALB to its expected configuration, deleting our custom rules.
Broke the connection again.
We tried adding the IP-based rules directly to the Kubernetes ingress configuration, but the ALB Ingress Controller primarily supports path-based (/api/users
) or host-based (api.example.com
) routing, not the IP-based rules we needed for this workaround.
The final solution: building a parallel system
Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand and maintain.
So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.x.x/x
subnet.
What we built

Dedicated ALB in New Subnet: We created a new ALB specifically for partner traffic, placed in the 10.120.x.x
subnet.
Partner-Specific Domain: We set up
partner-api.client.com
that resolves to the new ALB.Kubernetes-Managed Configuration: We updated the ingress rules to manage this new ALB. Now, the automation understood and maintained the entire setup.
Clean Traffic Flow: All communication from the PARTNER On-Prem: both request and response now happens exclusively through the
10.120.x.x
range.
We realized we didn’t need to fix the existing system; we needed a parallel one. Instead of renovating a busy highway, we built a dedicated express lane for VIP traffic. Regular traffic continued uninterrupted, while partner traffic got a clean, reliable path.
The solution was elegant because it offered:
Zero downtime for existing users.
Complete isolation of partner traffic from IP conflicts.
Automation-friendliness, with no manual configurations to be deleted.
Scalability for future partners.
Technically, this worked because partner traffic now enters and exits through the same IP range (10.120.x.x
), eliminating asymmetric routing, and the entire solution is defined in Kubernetes ingress configurations that our automation understands and maintains.
Key takeaways
This was a powerful reminder of a few core engineering truths :
Look beyond the "simple" fix: The most obvious solution often solves only part of the problem. Always trace the complete data flow to uncover hidden complexities like asymmetric routing.
Automation is king, so don't fight it: When a manual fix conflicts with automation, the fix is wrong, not the automation. The best solution always works with your automated systems, not against them.
Sometimes the best fix is a new path: Instead of trying to modify a complex, live system, creating a parallel, isolated solution can be simpler, safer, and more maintainable.
Pragmatism wins: We could have architected a far more complex networking solution. But the goal wasn't theoretical perfection; it was to solve the client's business problem in a way that was reliable, secure, and maintainable for their team.
Networking puzzles like these can be frustrating, but they're also incredibly satisfying to solve. By understanding the complete system and working with existing automation, we built a solution that just works.
If you're facing your own networking knots or challenges with scaling your infrastructure, we'd love to help. Reach out to us at One2N we enjoy a good puzzle.
Sometimes, you encounter a problem that seems straightforward at a glance but quickly spirals into a complex challenge. This is one of those stories. It began with a common business need: connecting two different networks securely.
For one of our clients, their business operations depended on a seamless data exchange with a key partner. The workflow was critical:
The CLIENT AWS infrastructure needed to receive raw data from their PARTNER On-Prem systems through APIs.
This data was then processed by internal services to generate comprehensive reports.
Finally, the generated reports had to be sent back automatically to the partner's systems.
To achieve this level of integration while maintaining security and compliance, a Site-to-Site VPN connection was the clear solution.

Think of it as a secure, private tunnel between two office buildings. Employees in Building A can access resources in Building B as if they were in the same location, with all communication traveling through an encrypted channel over the internet.
The client's infrastructure was hosted on AWS, and we had successfully implemented similar connections before using AWS Site-to-Site VPN. The architecture was standard:
AWS VPN Gateway: The client's AWS infrastructure hosts the VPN gateway.
Customer Gateway: The partner's on-premises network connects through their gateway.
Encryption: All traffic flows through an encrypted IPsec tunnel managed by AWS.
The setup process was routine: configure the VPN gateways on both ends, exchange security credentials, establish the tunnel, and update the routing tables. But this time was different.
The moment of truth, when everything failed
The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch.
What we expected to see:
What actually happened:
The VPN tunnel itself was established; monitoring dashboards showed it was "up." But when applications tried to communicate through it, every request failed with a timeout.
A classic networking nightmare
After diving into network logs and routing tables, we discovered the root cause. Both the CLIENT AWS and the PARTNER On-Prem were using the exact same private IP address range: 10.0.0.0/22
.

Imagine telling a mail carrier to deliver a package to "123 Main Street" in a town where two different houses share that exact address. The carrier has no idea which house you mean, so the package goes nowhere.

In networking, when a router sees a packet destined for an IP address that exists on both sides of a connection, it defaults to the local route. This is a common issue in VPC peering and VPN setups that can completely halt communication.
A simple fix that wasn't
The PARTNER On-Prem requested that the CLIENT AWS change their CIDR range. The initial thought was straightforward: let's introduce a new, non-overlapping secondary CIDR block, 10.120.0.0/23
, to the client's existing Virtual Private Cloud (VPC). The plan was to route all traffic destined for the partner through this new, unique address space.
We spun up a new subnet in this CIDR and set up a NAT (Network Address Translation) Gateway. A NAT Gateway acts like a receptionist for your network; it takes outgoing requests from the original IP range and makes them appear to come from the new IP range.
This worked for outbound traffic:
The CLIENT AWS app sends a request from its original
10.0.x.x
address.The NAT Gateway receives it and translates the source IP to a
10.120.x.x
address.The PARTNER On-Prem receives the request from
10.120.x.x
(which it recognizes) and responds.
We had outbound communication working.

The real puzzle with inbound traffic
Success was short-lived. When the partner's system tried to access our client's services, we discovered a new, more subtle problem: asymmetric routing.
The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x
subnet. The application received the request and sent a response. However, the response going back to the partner took a different path. Our new routing rules forced it through the NAT Gateway, which changed the source IP to the 10.120.x.x
range.
Here’s what the partner’s stateful firewall saw:
Step 1: Request Sent PARTNER On-Prem (
10.0.1.200
) → CLIENT AWS Load Balancer (10.0.2.50
)Step 2: Response Received CLIENT AWS (
10.120.1.1
via NAT) → PARTNER On-Prem (10.0.1.200
)
From the partner's perspective, this was a security threat:
"I sent a request TO 10.0.2.50."
"I got a response FROM 10.120.1.1."
"These don't match. This looks like an attack!"
Result: ❌ Connection dropped.
Their stateful firewall saw a response from an IP address it never sent a request to, flagged it as suspicious, and dropped the connection. It was like calling someone on the phone but having a complete stranger call you back with the answer.
Other solutions we tried
Before finding the right fix, we explored other common approaches.
Attempt #1: The "just change everything" idea
The most "obvious" solution is to change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the correct approach. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.
Attempt #2: The Network Load Balancer (NLB) approach
Our next idea was to place a Network Load Balancer (NLB) in front of our existing Application Load Balancer (ALB). The architecture was:
An NLB sits in the new
10.120.x.x
subnet.The NLB forwards all traffic to the existing ALB, which is configured as its target group.
The ALB uses IP-based rules: "If traffic comes from the NLB's IP range, route it to the backend services."
This seemed promising because it would solve the asymmetric routing problem. The request would come in via the NLB and the response would flow back through the exact same path. We manually configured these IP-based routing rules on the ALB, and it worked perfectly.

The automation problem
But then disaster struck. The client's infrastructure uses a Kubernetes ALB Ingress Controller that automatically manages ALB rules based on ingress configurations. When developers deployed the next day, this automation did exactly what it was designed to do:
Scanned the ALB configuration.
Found our manual IP-based rules, which it didn't recognize as part of its managed state.
Reset the ALB to its expected configuration, deleting our custom rules.
Broke the connection again.
We tried adding the IP-based rules directly to the Kubernetes ingress configuration, but the ALB Ingress Controller primarily supports path-based (/api/users
) or host-based (api.example.com
) routing, not the IP-based rules we needed for this workaround.
The final solution: building a parallel system
Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand and maintain.
So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.x.x/x
subnet.
What we built

Dedicated ALB in New Subnet: We created a new ALB specifically for partner traffic, placed in the 10.120.x.x
subnet.
Partner-Specific Domain: We set up
partner-api.client.com
that resolves to the new ALB.Kubernetes-Managed Configuration: We updated the ingress rules to manage this new ALB. Now, the automation understood and maintained the entire setup.
Clean Traffic Flow: All communication from the PARTNER On-Prem: both request and response now happens exclusively through the
10.120.x.x
range.
We realized we didn’t need to fix the existing system; we needed a parallel one. Instead of renovating a busy highway, we built a dedicated express lane for VIP traffic. Regular traffic continued uninterrupted, while partner traffic got a clean, reliable path.
The solution was elegant because it offered:
Zero downtime for existing users.
Complete isolation of partner traffic from IP conflicts.
Automation-friendliness, with no manual configurations to be deleted.
Scalability for future partners.
Technically, this worked because partner traffic now enters and exits through the same IP range (10.120.x.x
), eliminating asymmetric routing, and the entire solution is defined in Kubernetes ingress configurations that our automation understands and maintains.
Key takeaways
This was a powerful reminder of a few core engineering truths :
Look beyond the "simple" fix: The most obvious solution often solves only part of the problem. Always trace the complete data flow to uncover hidden complexities like asymmetric routing.
Automation is king, so don't fight it: When a manual fix conflicts with automation, the fix is wrong, not the automation. The best solution always works with your automated systems, not against them.
Sometimes the best fix is a new path: Instead of trying to modify a complex, live system, creating a parallel, isolated solution can be simpler, safer, and more maintainable.
Pragmatism wins: We could have architected a far more complex networking solution. But the goal wasn't theoretical perfection; it was to solve the client's business problem in a way that was reliable, secure, and maintainable for their team.
Networking puzzles like these can be frustrating, but they're also incredibly satisfying to solve. By understanding the complete system and working with existing automation, we built a solution that just works.
If you're facing your own networking knots or challenges with scaling your infrastructure, we'd love to help. Reach out to us at One2N we enjoy a good puzzle.
Sometimes, you encounter a problem that seems straightforward at a glance but quickly spirals into a complex challenge. This is one of those stories. It began with a common business need: connecting two different networks securely.
For one of our clients, their business operations depended on a seamless data exchange with a key partner. The workflow was critical:
The CLIENT AWS infrastructure needed to receive raw data from their PARTNER On-Prem systems through APIs.
This data was then processed by internal services to generate comprehensive reports.
Finally, the generated reports had to be sent back automatically to the partner's systems.
To achieve this level of integration while maintaining security and compliance, a Site-to-Site VPN connection was the clear solution.

Think of it as a secure, private tunnel between two office buildings. Employees in Building A can access resources in Building B as if they were in the same location, with all communication traveling through an encrypted channel over the internet.
The client's infrastructure was hosted on AWS, and we had successfully implemented similar connections before using AWS Site-to-Site VPN. The architecture was standard:
AWS VPN Gateway: The client's AWS infrastructure hosts the VPN gateway.
Customer Gateway: The partner's on-premises network connects through their gateway.
Encryption: All traffic flows through an encrypted IPsec tunnel managed by AWS.
The setup process was routine: configure the VPN gateways on both ends, exchange security credentials, establish the tunnel, and update the routing tables. But this time was different.
The moment of truth, when everything failed
The connection was configured, credentials were exchanged, and the moment of truth arrived. They flipped the switch.
What we expected to see:
What actually happened:
The VPN tunnel itself was established; monitoring dashboards showed it was "up." But when applications tried to communicate through it, every request failed with a timeout.
A classic networking nightmare
After diving into network logs and routing tables, we discovered the root cause. Both the CLIENT AWS and the PARTNER On-Prem were using the exact same private IP address range: 10.0.0.0/22
.

Imagine telling a mail carrier to deliver a package to "123 Main Street" in a town where two different houses share that exact address. The carrier has no idea which house you mean, so the package goes nowhere.

In networking, when a router sees a packet destined for an IP address that exists on both sides of a connection, it defaults to the local route. This is a common issue in VPC peering and VPN setups that can completely halt communication.
A simple fix that wasn't
The PARTNER On-Prem requested that the CLIENT AWS change their CIDR range. The initial thought was straightforward: let's introduce a new, non-overlapping secondary CIDR block, 10.120.0.0/23
, to the client's existing Virtual Private Cloud (VPC). The plan was to route all traffic destined for the partner through this new, unique address space.
We spun up a new subnet in this CIDR and set up a NAT (Network Address Translation) Gateway. A NAT Gateway acts like a receptionist for your network; it takes outgoing requests from the original IP range and makes them appear to come from the new IP range.
This worked for outbound traffic:
The CLIENT AWS app sends a request from its original
10.0.x.x
address.The NAT Gateway receives it and translates the source IP to a
10.120.x.x
address.The PARTNER On-Prem receives the request from
10.120.x.x
(which it recognizes) and responds.
We had outbound communication working.

The real puzzle with inbound traffic
Success was short-lived. When the partner's system tried to access our client's services, we discovered a new, more subtle problem: asymmetric routing.
The partner sent their request to the client's domain, which pointed to an Application Load Balancer (ALB) in the original 10.0.x.x
subnet. The application received the request and sent a response. However, the response going back to the partner took a different path. Our new routing rules forced it through the NAT Gateway, which changed the source IP to the 10.120.x.x
range.
Here’s what the partner’s stateful firewall saw:
Step 1: Request Sent PARTNER On-Prem (
10.0.1.200
) → CLIENT AWS Load Balancer (10.0.2.50
)Step 2: Response Received CLIENT AWS (
10.120.1.1
via NAT) → PARTNER On-Prem (10.0.1.200
)
From the partner's perspective, this was a security threat:
"I sent a request TO 10.0.2.50."
"I got a response FROM 10.120.1.1."
"These don't match. This looks like an attack!"
Result: ❌ Connection dropped.
Their stateful firewall saw a response from an IP address it never sent a request to, flagged it as suspicious, and dropped the connection. It was like calling someone on the phone but having a complete stranger call you back with the answer.
Other solutions we tried
Before finding the right fix, we explored other common approaches.
Attempt #1: The "just change everything" idea
The most "obvious" solution is to change the client's primary VPC CIDR to something unique. In a brand-new environment, this is the correct approach. But for a live, production system, this is a non-starter. It would mean re-configuring every single resource servers, databases, load balancers, security groups and would require massive downtime and carry an enormous risk. It was pragmatically impossible.
Attempt #2: The Network Load Balancer (NLB) approach
Our next idea was to place a Network Load Balancer (NLB) in front of our existing Application Load Balancer (ALB). The architecture was:
An NLB sits in the new
10.120.x.x
subnet.The NLB forwards all traffic to the existing ALB, which is configured as its target group.
The ALB uses IP-based rules: "If traffic comes from the NLB's IP range, route it to the backend services."
This seemed promising because it would solve the asymmetric routing problem. The request would come in via the NLB and the response would flow back through the exact same path. We manually configured these IP-based routing rules on the ALB, and it worked perfectly.

The automation problem
But then disaster struck. The client's infrastructure uses a Kubernetes ALB Ingress Controller that automatically manages ALB rules based on ingress configurations. When developers deployed the next day, this automation did exactly what it was designed to do:
Scanned the ALB configuration.
Found our manual IP-based rules, which it didn't recognize as part of its managed state.
Reset the ALB to its expected configuration, deleting our custom rules.
Broke the connection again.
We tried adding the IP-based rules directly to the Kubernetes ingress configuration, but the ALB Ingress Controller primarily supports path-based (/api/users
) or host-based (api.example.com
) routing, not the IP-based rules we needed for this workaround.
The final solution: building a parallel system
Fighting against automation is like fighting a losing battle. It’s a key DevOps rule to work with automation, not against it. Instead of forcing a manual fix, the best move is to find a solution that automation can understand and maintain.
So, we pivoted. Instead of a complex manual fix, we spun up a new, dedicated Application Load Balancer (ALB) and placed it inside the new 10.120.x.x/x
subnet.
What we built

Dedicated ALB in New Subnet: We created a new ALB specifically for partner traffic, placed in the 10.120.x.x
subnet.
Partner-Specific Domain: We set up
partner-api.client.com
that resolves to the new ALB.Kubernetes-Managed Configuration: We updated the ingress rules to manage this new ALB. Now, the automation understood and maintained the entire setup.
Clean Traffic Flow: All communication from the PARTNER On-Prem: both request and response now happens exclusively through the
10.120.x.x
range.
We realized we didn’t need to fix the existing system; we needed a parallel one. Instead of renovating a busy highway, we built a dedicated express lane for VIP traffic. Regular traffic continued uninterrupted, while partner traffic got a clean, reliable path.
The solution was elegant because it offered:
Zero downtime for existing users.
Complete isolation of partner traffic from IP conflicts.
Automation-friendliness, with no manual configurations to be deleted.
Scalability for future partners.
Technically, this worked because partner traffic now enters and exits through the same IP range (10.120.x.x
), eliminating asymmetric routing, and the entire solution is defined in Kubernetes ingress configurations that our automation understands and maintains.
Key takeaways
This was a powerful reminder of a few core engineering truths :
Look beyond the "simple" fix: The most obvious solution often solves only part of the problem. Always trace the complete data flow to uncover hidden complexities like asymmetric routing.
Automation is king, so don't fight it: When a manual fix conflicts with automation, the fix is wrong, not the automation. The best solution always works with your automated systems, not against them.
Sometimes the best fix is a new path: Instead of trying to modify a complex, live system, creating a parallel, isolated solution can be simpler, safer, and more maintainable.
Pragmatism wins: We could have architected a far more complex networking solution. But the goal wasn't theoretical perfection; it was to solve the client's business problem in a way that was reliable, secure, and maintainable for their team.
Networking puzzles like these can be frustrating, but they're also incredibly satisfying to solve. By understanding the complete system and working with existing automation, we built a solution that just works.
If you're facing your own networking knots or challenges with scaling your infrastructure, we'd love to help. Reach out to us at One2N we enjoy a good puzzle.