Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#MLOps

#How-To

#SRE

Aug 19, 2024 | 5 min read

Eliminating bot-tlenecks - spin machines from containers

Ajinkya Wadekar

SRE @One2N

Back to Blog

#MLOps

#How-To

#SRE

Aug 19, 2024 | 5 min read

Eliminating bot-tlenecks - spin machines from containers

Ajinkya Wadekar

SRE @One2N

Back to Blog

#MLOps

#How-To

#SRE

Aug 19, 2024 | 5 min read

Eliminating bot-tlenecks - spin machines from containers

Ajinkya Wadekar

SRE @One2N

Back to Blog

#MLOps

#How-To

#SRE

Aug 19, 2024 | 5 min read

Eliminating bot-tlenecks - spin machines from containers

Ajinkya Wadekar

SRE @One2N

Reviewed and edited by Spandan Ghosh, Saurabh Hirani

Introduction

We recently embarked on an exciting project for one of our customers at One2N: building an AI chatbot application using the Pipecat-AI tool. The chatbot was designed to interact with users, providing both training and domain-specific information. For instance, the chatbot could train the sales team on details about medication for a particular therapy.

While the chatbot performed well in the development environment, it faced significant challenges during load testing when exposed to real-world scale. This post delves into the obstacles we encountered and the innovative solutions we explored to not only resolve our current issues but also lay the groundwork for future chatbots for the same customer.

Deployment model - ECS

We used AWS ECS (Elastic Container Service) to host the application. Here’s a quick refresher on some ECS concepts that we will reference throughout this post:

ECS Task: Represents a unit of work running in the ECS cluster, typically a Docker container.
ECS Task Definition: Blueprint that specifies which Docker image to use, the amount of CPU and memory to allocate, environment variables, network settings, and other configuration details.
ECS Service: Responsible for launching the tasks defined by the task definition within the ECS cluster. Configured to automatically scale the number of tasks up or down based on demand.
ECS Cluster: Hosts the ECS tasks, each running a Docker container with the application.

Deployment diagram

Load testing observations

Since we were using a new framework, we wanted to subject it to load testing to understand its behavior under stress. During load test setups, we observed the following:

With 2 vCPU and 4 GB of RAM, each task hit its resource limits after the creation of 10-15 chat rooms. CPU and memory usage spiked, causing the UI to become unresponsive due to delayed responses from the backend.
New tasks were slow to start because the Docker image was built in a way that required downloading AI models from the internet at runtime, resulting in a large image size of around 6GB.
As a result, scaling up based on CloudWatch metrics was delayed. Even though the scale-up was triggered on time, the containers took too long to serve their first request due to the high latency.

To ensure low latency during both runtime and startup, we explored the option to overprovision capacity to provide sufficient headroom. However, the application's load was unpredictable, as it wasn't tied to any seasonal patterns. Additionally, since this was a new feature being launched for multiple teams eager to be trained by the agents, maintaining a good user experience was crucial.

Root cause

Upon further investigation, we discovered that the following snippet was launching one subprocess per chat room:

        proc = subprocess.Popen(
            [
                f"python3 -m bot -u {room.url} -t {token}"
            ],
            shell=True,
            bufsize=1,
            cwd=os.path.dirname(os.path.abspath(__file__))
        )
        bot_procs[proc.pid] = (proc, room.url)

The bottleneck was caused by spinning up a separate process for each chat room, which allocated a base amount of CPU and memory per process.

To address this issue, we needed to solve the following problems:

Handling Multiple Parallel Chat Rooms: How can we manage multiple chat rooms concurrently without spinning up too many subprocesses?
Controlled ECS Scaling: How can we ensure that the ECS service doesn’t scale out tasks in an unbounded manner, keeping costs under control?

This realization led us to question our development stack. However, we couldn’t leverage other modes of concurrency in Python because we needed to adhere to the approaches suggested by Pipecat-AI examples. Additionally, switching to a more performant language like Golang wasn't an option as this was going to be integrated into an existing Python codebase. As a result, we were constrained to finding a solution within our existing Python framework while minimizing OS-level forking.

Solution

We were exploring other chatbot solutions and came across this example added by the good folks at fly.io

https://github.com/pipecat-ai/pipecat/tree/main/examples/deployment/flyio-example

It seemed like deployment-at-first-sight of the README which opened with:

"This project modifies the bot_runner.py server to launch a new machine for each user session.This is a recommended approach for production vs. running shell processes as your deployment will quickly run out of system resources under load."

One machine per user session? Isn't that overkill?

We looked a little more at what fly.io had to offer

Curiouser and curiouser. But will it spin up on time and what do we do when they are idle:

Trying fly.io from the CLI

Before integrating their bot code, we decided to test Fly.io to see how it works. The steps were straightforward:

Create an Account: Sign up on Fly.io.
Authenticate via CLI: Use fly auth signup to authenticate.
Generate API Tokens: To use the API, generate tokens with fly auth token.
Create a Fly Configuration File: Set up a configuration file in TOML format. Here’s a sample:

app = 'pipecat-fly-example'
primary_region = 'sjc'

[build]

[env]
  FLY_APP_NAME = 'pipecat-fly-example'

[http_service]
  internal_port = 7860
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = 512
  cpu_kind = 'shared'
  cpus = 1

Deploy the App:
- If you’re in the same directory as your target Dockerfile, simply run fly launch.
- Fly.io handles authentication, authorization, builds the image, creates a container, and provides a URL.
  Visit your newly deployed app at: https://testurl.fly.dev/.

While this process was simple enough from the CLI, we then asked ourselves: how can we integrate this into a Python workflow?

Understanding fly.io from the bot runner code

The commands mentioned earlier have corresponding REST API endpoint equivalents, allowing you to perform similar actions programmatically. Here's how you can do it:

Use Fly.io's REST API: Spawn a new machine using the appropriate REST API endpoints.
Obtain the Application URL: Use the URL returned by Fly.io as your application's endpoint.
Interact with Your Application: Make REST API calls against the application URL.

Set Environment Variables:

FLY_API_HOST=<your_fly_api_host>
FLY_API_KEY=<your_fly_api_key>
FLY_APP_NAME

Set Fly.io Headers:

FLY_HEADERS = {
    'Authorization': f"Bearer {FLY_API_KEY}",
    'Content-Type': 'application/json'

Spawn a New Fly.io Machine:

session.post(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS, json=worker_props)

Connect to Your Bot:

curl --location --request POST 'https://YOUR_FLY_APP_NAME/start_bot'

Typically, one would assume that the bot code is running inside a container. However, in this case, the code to create the machine is also part of the bot code. This approach requires understanding an additional external API, but it also shifts control over the compute environment to the developer.

So did it work?

Yes. We observed the following improvements:

Faster Machine Spin-Up Time: The time improved from 200-300 seconds to under 100 seconds. Machines for each additional room were created under 10 seconds.
Predictable Scaling: With each room running as a separate machine, we no longer had to worry about the current container's resource limitations.

Was It a silver bullet?

Nothing ever is. Offloading compute to an API incurs additional costs beyond what you pay for ECS. The key question is: Is the cost justifiable?

For us, it was manageable. We spent around $20 last month for our scale.

However, we are actively exploring ways to keep costs in check in case of an unexpected surge. We're also investigating options to receive advance notifications if we approach our plan limits, ensuring we can manage any potential overcharges effectively. Here’s the current pricing for fly.io.

Key takeaways

Use the Right Tool for the Job: Sometimes this means considering tools outside your current toolchain. Don’t be afraid to explore new options that better fit the task at hand.
Conduct Load Testing: Load testing in preview environments allowed us to catch potential scaling issues early, giving us the necessary time to implement solutions before going live.
Understand Tradeoffs: Every solution involves tradeoffs. While we offloaded compute to an external service, it’s crucial to fully understand the implications—such as imposed limits, when they might be hit, and how to respond when they are.
Continuously Monitor and Optimize: Ensure that you're continually monitoring your application's performance and costs, and be proactive in optimizing both as your application scales.

By leveraging Fly.io, the team was able to address the performance issues and ensure a more scalable and responsive chatbot application.

Are you facing similar challenges with your AI chatbot or LLM applications? Do you need expert help in optimizing your deployment on Fly.io or building customized conversational chatbots? Reach out to us at One2N for engineering assistance tailored to your needs. Our team of experienced engineers is here to help you navigate the complexities of LLMops and ensure your applications run smoothly and efficiently.

Reviewed and edited by Spandan Ghosh, Saurabh Hirani

Introduction

Deployment model - ECS

We used AWS ECS (Elastic Container Service) to host the application. Here’s a quick refresher on some ECS concepts that we will reference throughout this post:

ECS Task: Represents a unit of work running in the ECS cluster, typically a Docker container.
ECS Task Definition: Blueprint that specifies which Docker image to use, the amount of CPU and memory to allocate, environment variables, network settings, and other configuration details.
ECS Service: Responsible for launching the tasks defined by the task definition within the ECS cluster. Configured to automatically scale the number of tasks up or down based on demand.
ECS Cluster: Hosts the ECS tasks, each running a Docker container with the application.

Deployment diagram

Load testing observations

Since we were using a new framework, we wanted to subject it to load testing to understand its behavior under stress. During load test setups, we observed the following:

With 2 vCPU and 4 GB of RAM, each task hit its resource limits after the creation of 10-15 chat rooms. CPU and memory usage spiked, causing the UI to become unresponsive due to delayed responses from the backend.
New tasks were slow to start because the Docker image was built in a way that required downloading AI models from the internet at runtime, resulting in a large image size of around 6GB.
As a result, scaling up based on CloudWatch metrics was delayed. Even though the scale-up was triggered on time, the containers took too long to serve their first request due to the high latency.

Root cause

Upon further investigation, we discovered that the following snippet was launching one subprocess per chat room:

        proc = subprocess.Popen(
            [
                f"python3 -m bot -u {room.url} -t {token}"
            ],
            shell=True,
            bufsize=1,
            cwd=os.path.dirname(os.path.abspath(__file__))
        )
        bot_procs[proc.pid] = (proc, room.url)

The bottleneck was caused by spinning up a separate process for each chat room, which allocated a base amount of CPU and memory per process.

To address this issue, we needed to solve the following problems:

Handling Multiple Parallel Chat Rooms: How can we manage multiple chat rooms concurrently without spinning up too many subprocesses?
Controlled ECS Scaling: How can we ensure that the ECS service doesn’t scale out tasks in an unbounded manner, keeping costs under control?

Solution

We were exploring other chatbot solutions and came across this example added by the good folks at fly.io

https://github.com/pipecat-ai/pipecat/tree/main/examples/deployment/flyio-example

It seemed like deployment-at-first-sight of the README which opened with:

"This project modifies the bot_runner.py server to launch a new machine for each user session.This is a recommended approach for production vs. running shell processes as your deployment will quickly run out of system resources under load."

One machine per user session? Isn't that overkill?

We looked a little more at what fly.io had to offer

Curiouser and curiouser. But will it spin up on time and what do we do when they are idle:

Trying fly.io from the CLI

Before integrating their bot code, we decided to test Fly.io to see how it works. The steps were straightforward:

Create an Account: Sign up on Fly.io.
Authenticate via CLI: Use fly auth signup to authenticate.
Generate API Tokens: To use the API, generate tokens with fly auth token.
Create a Fly Configuration File: Set up a configuration file in TOML format. Here’s a sample:

app = 'pipecat-fly-example'
primary_region = 'sjc'

[build]

[env]
  FLY_APP_NAME = 'pipecat-fly-example'

[http_service]
  internal_port = 7860
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = 512
  cpu_kind = 'shared'
  cpus = 1

Deploy the App:
- If you’re in the same directory as your target Dockerfile, simply run fly launch.
- Fly.io handles authentication, authorization, builds the image, creates a container, and provides a URL.
  Visit your newly deployed app at: https://testurl.fly.dev/.

While this process was simple enough from the CLI, we then asked ourselves: how can we integrate this into a Python workflow?

Understanding fly.io from the bot runner code

The commands mentioned earlier have corresponding REST API endpoint equivalents, allowing you to perform similar actions programmatically. Here's how you can do it:

Use Fly.io's REST API: Spawn a new machine using the appropriate REST API endpoints.
Obtain the Application URL: Use the URL returned by Fly.io as your application's endpoint.
Interact with Your Application: Make REST API calls against the application URL.

Set Environment Variables:

FLY_API_HOST=<your_fly_api_host>
FLY_API_KEY=<your_fly_api_key>
FLY_APP_NAME

Set Fly.io Headers:

FLY_HEADERS = {
    'Authorization': f"Bearer {FLY_API_KEY}",
    'Content-Type': 'application/json'

Spawn a New Fly.io Machine:

session.post(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS, json=worker_props)

Connect to Your Bot:

curl --location --request POST 'https://YOUR_FLY_APP_NAME/start_bot'

So did it work?

Yes. We observed the following improvements:

Faster Machine Spin-Up Time: The time improved from 200-300 seconds to under 100 seconds. Machines for each additional room were created under 10 seconds.
Predictable Scaling: With each room running as a separate machine, we no longer had to worry about the current container's resource limitations.

Was It a silver bullet?

Nothing ever is. Offloading compute to an API incurs additional costs beyond what you pay for ECS. The key question is: Is the cost justifiable?

For us, it was manageable. We spent around $20 last month for our scale.

Key takeaways

Use the Right Tool for the Job: Sometimes this means considering tools outside your current toolchain. Don’t be afraid to explore new options that better fit the task at hand.
Conduct Load Testing: Load testing in preview environments allowed us to catch potential scaling issues early, giving us the necessary time to implement solutions before going live.
Understand Tradeoffs: Every solution involves tradeoffs. While we offloaded compute to an external service, it’s crucial to fully understand the implications—such as imposed limits, when they might be hit, and how to respond when they are.
Continuously Monitor and Optimize: Ensure that you're continually monitoring your application's performance and costs, and be proactive in optimizing both as your application scales.

By leveraging Fly.io, the team was able to address the performance issues and ensure a more scalable and responsive chatbot application.

Reviewed and edited by Spandan Ghosh, Saurabh Hirani

Introduction

Deployment model - ECS

We used AWS ECS (Elastic Container Service) to host the application. Here’s a quick refresher on some ECS concepts that we will reference throughout this post:

ECS Task: Represents a unit of work running in the ECS cluster, typically a Docker container.
ECS Task Definition: Blueprint that specifies which Docker image to use, the amount of CPU and memory to allocate, environment variables, network settings, and other configuration details.
ECS Service: Responsible for launching the tasks defined by the task definition within the ECS cluster. Configured to automatically scale the number of tasks up or down based on demand.
ECS Cluster: Hosts the ECS tasks, each running a Docker container with the application.

Deployment diagram

Load testing observations

Since we were using a new framework, we wanted to subject it to load testing to understand its behavior under stress. During load test setups, we observed the following:

With 2 vCPU and 4 GB of RAM, each task hit its resource limits after the creation of 10-15 chat rooms. CPU and memory usage spiked, causing the UI to become unresponsive due to delayed responses from the backend.
New tasks were slow to start because the Docker image was built in a way that required downloading AI models from the internet at runtime, resulting in a large image size of around 6GB.
As a result, scaling up based on CloudWatch metrics was delayed. Even though the scale-up was triggered on time, the containers took too long to serve their first request due to the high latency.

Root cause

Upon further investigation, we discovered that the following snippet was launching one subprocess per chat room:

        proc = subprocess.Popen(
            [
                f"python3 -m bot -u {room.url} -t {token}"
            ],
            shell=True,
            bufsize=1,
            cwd=os.path.dirname(os.path.abspath(__file__))
        )
        bot_procs[proc.pid] = (proc, room.url)

The bottleneck was caused by spinning up a separate process for each chat room, which allocated a base amount of CPU and memory per process.

To address this issue, we needed to solve the following problems:

Handling Multiple Parallel Chat Rooms: How can we manage multiple chat rooms concurrently without spinning up too many subprocesses?
Controlled ECS Scaling: How can we ensure that the ECS service doesn’t scale out tasks in an unbounded manner, keeping costs under control?

Solution

We were exploring other chatbot solutions and came across this example added by the good folks at fly.io

https://github.com/pipecat-ai/pipecat/tree/main/examples/deployment/flyio-example

It seemed like deployment-at-first-sight of the README which opened with:

"This project modifies the bot_runner.py server to launch a new machine for each user session.This is a recommended approach for production vs. running shell processes as your deployment will quickly run out of system resources under load."

One machine per user session? Isn't that overkill?

We looked a little more at what fly.io had to offer

Curiouser and curiouser. But will it spin up on time and what do we do when they are idle:

Trying fly.io from the CLI

Before integrating their bot code, we decided to test Fly.io to see how it works. The steps were straightforward:

Create an Account: Sign up on Fly.io.
Authenticate via CLI: Use fly auth signup to authenticate.
Generate API Tokens: To use the API, generate tokens with fly auth token.
Create a Fly Configuration File: Set up a configuration file in TOML format. Here’s a sample:

app = 'pipecat-fly-example'
primary_region = 'sjc'

[build]

[env]
  FLY_APP_NAME = 'pipecat-fly-example'

[http_service]
  internal_port = 7860
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = 512
  cpu_kind = 'shared'
  cpus = 1

Deploy the App:
- If you’re in the same directory as your target Dockerfile, simply run fly launch.
- Fly.io handles authentication, authorization, builds the image, creates a container, and provides a URL.
  Visit your newly deployed app at: https://testurl.fly.dev/.

While this process was simple enough from the CLI, we then asked ourselves: how can we integrate this into a Python workflow?

Understanding fly.io from the bot runner code

The commands mentioned earlier have corresponding REST API endpoint equivalents, allowing you to perform similar actions programmatically. Here's how you can do it:

Use Fly.io's REST API: Spawn a new machine using the appropriate REST API endpoints.
Obtain the Application URL: Use the URL returned by Fly.io as your application's endpoint.
Interact with Your Application: Make REST API calls against the application URL.

Set Environment Variables:

FLY_API_HOST=<your_fly_api_host>
FLY_API_KEY=<your_fly_api_key>
FLY_APP_NAME

Set Fly.io Headers:

FLY_HEADERS = {
    'Authorization': f"Bearer {FLY_API_KEY}",
    'Content-Type': 'application/json'

Spawn a New Fly.io Machine:

session.post(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS, json=worker_props)

Connect to Your Bot:

curl --location --request POST 'https://YOUR_FLY_APP_NAME/start_bot'

So did it work?

Yes. We observed the following improvements:

Faster Machine Spin-Up Time: The time improved from 200-300 seconds to under 100 seconds. Machines for each additional room were created under 10 seconds.
Predictable Scaling: With each room running as a separate machine, we no longer had to worry about the current container's resource limitations.

Was It a silver bullet?

Nothing ever is. Offloading compute to an API incurs additional costs beyond what you pay for ECS. The key question is: Is the cost justifiable?

For us, it was manageable. We spent around $20 last month for our scale.

Key takeaways

Use the Right Tool for the Job: Sometimes this means considering tools outside your current toolchain. Don’t be afraid to explore new options that better fit the task at hand.
Conduct Load Testing: Load testing in preview environments allowed us to catch potential scaling issues early, giving us the necessary time to implement solutions before going live.
Understand Tradeoffs: Every solution involves tradeoffs. While we offloaded compute to an external service, it’s crucial to fully understand the implications—such as imposed limits, when they might be hit, and how to respond when they are.
Continuously Monitor and Optimize: Ensure that you're continually monitoring your application's performance and costs, and be proactive in optimizing both as your application scales.

By leveraging Fly.io, the team was able to address the performance issues and ensure a more scalable and responsive chatbot application.

Reviewed and edited by Spandan Ghosh, Saurabh Hirani

Introduction

Deployment model - ECS

We used AWS ECS (Elastic Container Service) to host the application. Here’s a quick refresher on some ECS concepts that we will reference throughout this post:

ECS Task: Represents a unit of work running in the ECS cluster, typically a Docker container.
ECS Task Definition: Blueprint that specifies which Docker image to use, the amount of CPU and memory to allocate, environment variables, network settings, and other configuration details.
ECS Service: Responsible for launching the tasks defined by the task definition within the ECS cluster. Configured to automatically scale the number of tasks up or down based on demand.
ECS Cluster: Hosts the ECS tasks, each running a Docker container with the application.

Deployment diagram

Load testing observations

Since we were using a new framework, we wanted to subject it to load testing to understand its behavior under stress. During load test setups, we observed the following:

With 2 vCPU and 4 GB of RAM, each task hit its resource limits after the creation of 10-15 chat rooms. CPU and memory usage spiked, causing the UI to become unresponsive due to delayed responses from the backend.
New tasks were slow to start because the Docker image was built in a way that required downloading AI models from the internet at runtime, resulting in a large image size of around 6GB.
As a result, scaling up based on CloudWatch metrics was delayed. Even though the scale-up was triggered on time, the containers took too long to serve their first request due to the high latency.

Root cause

Upon further investigation, we discovered that the following snippet was launching one subprocess per chat room:

        proc = subprocess.Popen(
            [
                f"python3 -m bot -u {room.url} -t {token}"
            ],
            shell=True,
            bufsize=1,
            cwd=os.path.dirname(os.path.abspath(__file__))
        )
        bot_procs[proc.pid] = (proc, room.url)

The bottleneck was caused by spinning up a separate process for each chat room, which allocated a base amount of CPU and memory per process.

To address this issue, we needed to solve the following problems:

Handling Multiple Parallel Chat Rooms: How can we manage multiple chat rooms concurrently without spinning up too many subprocesses?
Controlled ECS Scaling: How can we ensure that the ECS service doesn’t scale out tasks in an unbounded manner, keeping costs under control?

Solution

We were exploring other chatbot solutions and came across this example added by the good folks at fly.io

https://github.com/pipecat-ai/pipecat/tree/main/examples/deployment/flyio-example

It seemed like deployment-at-first-sight of the README which opened with:

"This project modifies the bot_runner.py server to launch a new machine for each user session.This is a recommended approach for production vs. running shell processes as your deployment will quickly run out of system resources under load."

One machine per user session? Isn't that overkill?

We looked a little more at what fly.io had to offer

Curiouser and curiouser. But will it spin up on time and what do we do when they are idle:

Trying fly.io from the CLI

Before integrating their bot code, we decided to test Fly.io to see how it works. The steps were straightforward:

Create an Account: Sign up on Fly.io.
Authenticate via CLI: Use fly auth signup to authenticate.
Generate API Tokens: To use the API, generate tokens with fly auth token.
Create a Fly Configuration File: Set up a configuration file in TOML format. Here’s a sample:

app = 'pipecat-fly-example'
primary_region = 'sjc'

[build]

[env]
  FLY_APP_NAME = 'pipecat-fly-example'

[http_service]
  internal_port = 7860
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = 512
  cpu_kind = 'shared'
  cpus = 1

Deploy the App:
- If you’re in the same directory as your target Dockerfile, simply run fly launch.
- Fly.io handles authentication, authorization, builds the image, creates a container, and provides a URL.
  Visit your newly deployed app at: https://testurl.fly.dev/.

While this process was simple enough from the CLI, we then asked ourselves: how can we integrate this into a Python workflow?

Understanding fly.io from the bot runner code

The commands mentioned earlier have corresponding REST API endpoint equivalents, allowing you to perform similar actions programmatically. Here's how you can do it:

Use Fly.io's REST API: Spawn a new machine using the appropriate REST API endpoints.
Obtain the Application URL: Use the URL returned by Fly.io as your application's endpoint.
Interact with Your Application: Make REST API calls against the application URL.

Set Environment Variables:

FLY_API_HOST=<your_fly_api_host>
FLY_API_KEY=<your_fly_api_key>
FLY_APP_NAME

Set Fly.io Headers:

FLY_HEADERS = {
    'Authorization': f"Bearer {FLY_API_KEY}",
    'Content-Type': 'application/json'

Spawn a New Fly.io Machine:

session.post(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS, json=worker_props)

Connect to Your Bot:

curl --location --request POST 'https://YOUR_FLY_APP_NAME/start_bot'

So did it work?

Yes. We observed the following improvements:

Faster Machine Spin-Up Time: The time improved from 200-300 seconds to under 100 seconds. Machines for each additional room were created under 10 seconds.
Predictable Scaling: With each room running as a separate machine, we no longer had to worry about the current container's resource limitations.

Was It a silver bullet?

Nothing ever is. Offloading compute to an API incurs additional costs beyond what you pay for ECS. The key question is: Is the cost justifiable?

For us, it was manageable. We spent around $20 last month for our scale.

Key takeaways

Use the Right Tool for the Job: Sometimes this means considering tools outside your current toolchain. Don’t be afraid to explore new options that better fit the task at hand.
Conduct Load Testing: Load testing in preview environments allowed us to catch potential scaling issues early, giving us the necessary time to implement solutions before going live.
Understand Tradeoffs: Every solution involves tradeoffs. While we offloaded compute to an external service, it’s crucial to fully understand the implications—such as imposed limits, when they might be hit, and how to respond when they are.
Continuously Monitor and Optimize: Ensure that you're continually monitoring your application's performance and costs, and be proactive in optimizing both as your application scales.

By leveraging Fly.io, the team was able to address the performance issues and ensure a more scalable and responsive chatbot application.

Reviewed and edited by Spandan Ghosh, Saurabh Hirani

Introduction

Deployment model - ECS

We used AWS ECS (Elastic Container Service) to host the application. Here’s a quick refresher on some ECS concepts that we will reference throughout this post:

ECS Task: Represents a unit of work running in the ECS cluster, typically a Docker container.
ECS Task Definition: Blueprint that specifies which Docker image to use, the amount of CPU and memory to allocate, environment variables, network settings, and other configuration details.
ECS Service: Responsible for launching the tasks defined by the task definition within the ECS cluster. Configured to automatically scale the number of tasks up or down based on demand.
ECS Cluster: Hosts the ECS tasks, each running a Docker container with the application.

Deployment diagram

Load testing observations

Since we were using a new framework, we wanted to subject it to load testing to understand its behavior under stress. During load test setups, we observed the following:

With 2 vCPU and 4 GB of RAM, each task hit its resource limits after the creation of 10-15 chat rooms. CPU and memory usage spiked, causing the UI to become unresponsive due to delayed responses from the backend.
New tasks were slow to start because the Docker image was built in a way that required downloading AI models from the internet at runtime, resulting in a large image size of around 6GB.
As a result, scaling up based on CloudWatch metrics was delayed. Even though the scale-up was triggered on time, the containers took too long to serve their first request due to the high latency.

Root cause

Upon further investigation, we discovered that the following snippet was launching one subprocess per chat room:

        proc = subprocess.Popen(
            [
                f"python3 -m bot -u {room.url} -t {token}"
            ],
            shell=True,
            bufsize=1,
            cwd=os.path.dirname(os.path.abspath(__file__))
        )
        bot_procs[proc.pid] = (proc, room.url)

The bottleneck was caused by spinning up a separate process for each chat room, which allocated a base amount of CPU and memory per process.

To address this issue, we needed to solve the following problems:

Handling Multiple Parallel Chat Rooms: How can we manage multiple chat rooms concurrently without spinning up too many subprocesses?
Controlled ECS Scaling: How can we ensure that the ECS service doesn’t scale out tasks in an unbounded manner, keeping costs under control?

Solution

We were exploring other chatbot solutions and came across this example added by the good folks at fly.io

https://github.com/pipecat-ai/pipecat/tree/main/examples/deployment/flyio-example

It seemed like deployment-at-first-sight of the README which opened with:

"This project modifies the bot_runner.py server to launch a new machine for each user session.This is a recommended approach for production vs. running shell processes as your deployment will quickly run out of system resources under load."

One machine per user session? Isn't that overkill?

We looked a little more at what fly.io had to offer

Curiouser and curiouser. But will it spin up on time and what do we do when they are idle:

Trying fly.io from the CLI

Before integrating their bot code, we decided to test Fly.io to see how it works. The steps were straightforward:

Create an Account: Sign up on Fly.io.
Authenticate via CLI: Use fly auth signup to authenticate.
Generate API Tokens: To use the API, generate tokens with fly auth token.
Create a Fly Configuration File: Set up a configuration file in TOML format. Here’s a sample:

app = 'pipecat-fly-example'
primary_region = 'sjc'

[build]

[env]
  FLY_APP_NAME = 'pipecat-fly-example'

[http_service]
  internal_port = 7860
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = 512
  cpu_kind = 'shared'
  cpus = 1

Deploy the App:
- If you’re in the same directory as your target Dockerfile, simply run fly launch.
- Fly.io handles authentication, authorization, builds the image, creates a container, and provides a URL.
  Visit your newly deployed app at: https://testurl.fly.dev/.

While this process was simple enough from the CLI, we then asked ourselves: how can we integrate this into a Python workflow?

Understanding fly.io from the bot runner code

The commands mentioned earlier have corresponding REST API endpoint equivalents, allowing you to perform similar actions programmatically. Here's how you can do it:

Use Fly.io's REST API: Spawn a new machine using the appropriate REST API endpoints.
Obtain the Application URL: Use the URL returned by Fly.io as your application's endpoint.
Interact with Your Application: Make REST API calls against the application URL.

Set Environment Variables:

FLY_API_HOST=<your_fly_api_host>
FLY_API_KEY=<your_fly_api_key>
FLY_APP_NAME

Set Fly.io Headers:

FLY_HEADERS = {
    'Authorization': f"Bearer {FLY_API_KEY}",
    'Content-Type': 'application/json'

Spawn a New Fly.io Machine:

session.post(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS, json=worker_props)

Connect to Your Bot:

curl --location --request POST 'https://YOUR_FLY_APP_NAME/start_bot'

So did it work?

Yes. We observed the following improvements:

Faster Machine Spin-Up Time: The time improved from 200-300 seconds to under 100 seconds. Machines for each additional room were created under 10 seconds.
Predictable Scaling: With each room running as a separate machine, we no longer had to worry about the current container's resource limitations.

Was It a silver bullet?

Nothing ever is. Offloading compute to an API incurs additional costs beyond what you pay for ECS. The key question is: Is the cost justifiable?

For us, it was manageable. We spent around $20 last month for our scale.

Key takeaways

Use the Right Tool for the Job: Sometimes this means considering tools outside your current toolchain. Don’t be afraid to explore new options that better fit the task at hand.
Conduct Load Testing: Load testing in preview environments allowed us to catch potential scaling issues early, giving us the necessary time to implement solutions before going live.
Understand Tradeoffs: Every solution involves tradeoffs. While we offloaded compute to an external service, it’s crucial to fully understand the implications—such as imposed limits, when they might be hit, and how to respond when they are.
Continuously Monitor and Optimize: Ensure that you're continually monitoring your application's performance and costs, and be proactive in optimizing both as your application scales.

By leveraging Fly.io, the team was able to address the performance issues and ensure a more scalable and responsive chatbot application.

Jump to section

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

July 15, 2025 | 7 min read

How we solved a critical site-to-site VPN IP address conflict in AWS

Rishiraj Rathore

SRE @One2N

Spandan Ghosh

Content @One2N

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Blog

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Services

Resources

Company

Eliminating bot-tlenecks - spin machines from containers

Eliminating bot-tlenecks - spin machines from containers

Eliminating bot-tlenecks - spin machines from containers

Eliminating bot-tlenecks - spin machines from containers

Eliminating bot-tlenecks - spin machines from containers

Share

Jump to section

Related posts

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

How we solved a critical site-to-site VPN IP address conflict in AWS

How One2N solved a critical site-to-site VPN IP address conflict in AWS, including details on the problem, failed solutions, and the final parallel system approach that worked with automation.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content