Services

Resources

Company

Book a Call

Fly.io

Aug 19, 2024 | 5 min read

Eliminating bot-tlenecks - spin machines from containers

Eliminating bot-tlenecks - spin machines from containers

SRE @One2N

Fly.io

Aug 19, 2024 | 5 min read

Eliminating bot-tlenecks - spin machines from containers

SRE @One2N

Fly.io

Aug 19, 2024 | 5 min read

Eliminating bot-tlenecks - spin machines from containers

SRE @One2N

Learn how we improved our AI chatbot application using innovative solutions. Find out about the challenges we faced and the strategies we used to boost performance and scalability.

Reviewed and edited by Spandan Ghosh, Saurabh Hirani

Introduction

We recently embarked on an exciting project for one of our customers at One2N: building an AI chatbot application using the Pipecat-AI tool. The chatbot was designed to interact with users, providing both training and domain-specific information. For instance, the chatbot could train the sales team on details about medication for a particular therapy.

While the chatbot performed well in the development environment, it faced significant challenges during load testing when exposed to real-world scale. This post delves into the obstacles we encountered and the innovative solutions we explored to not only resolve our current issues but also lay the groundwork for future chatbots for the same customer.

Deployment Model - ECS

We used AWS ECS (Elastic Container Service) to host the application. Here’s a quick refresher on some ECS concepts that we will reference throughout this post:

  • ECS Task: Represents a unit of work running in the ECS cluster, typically a Docker container.

  • ECS Task Definition: Blueprint that specifies which Docker image to use, the amount of CPU and memory to allocate, environment variables, network settings, and other configuration details.

  • ECS Service: Responsible for launching the tasks defined by the task definition within the ECS cluster. Configured to automatically scale the number of tasks up or down based on demand.

  • ECS Cluster: Hosts the ECS tasks, each running a Docker container with the application.

Deployment Diagram

Load Testing Observations

Since we were using a new framework, we wanted to subject it to load testing to understand its behavior under stress. During load test setups, we observed the following:

  1. With 2 vCPU and 4 GB of RAM, each task hit its resource limits after the creation of 10-15 chat rooms. CPU and memory usage spiked, causing the UI to become unresponsive due to delayed responses from the backend.

  2. New tasks were slow to start because the Docker image was built in a way that required downloading AI models from the internet at runtime, resulting in a large image size of around 6GB.

  3. As a result, scaling up based on CloudWatch metrics was delayed. Even though the scale-up was triggered on time, the containers took too long to serve their first request due to the high latency.

To ensure low latency during both runtime and startup, we explored the option to overprovision capacity to provide sufficient headroom. However, the application's load was unpredictable, as it wasn't tied to any seasonal patterns. Additionally, since this was a new feature being launched for multiple teams eager to be trained by the agents, maintaining a good user experience was crucial.

Root cause

Upon further investigation, we discovered that the following snippet was launching one subprocess per chat room:

        proc = subprocess.Popen(
            [
                f"python3 -m bot -u {room.url} -t {token}"
            ],
            shell=True,
            bufsize=1,
            cwd=os.path.dirname(os.path.abspath(__file__))
        )
        bot_procs[proc.pid] = (proc, room.url)

The bottleneck was caused by spinning up a separate process for each chat room, which allocated a base amount of CPU and memory per process.

To address this issue, we needed to solve the following problems:

  1. Handling Multiple Parallel Chat Rooms: How can we manage multiple chat rooms concurrently without spinning up too many subprocesses?

  2. Controlled ECS Scaling: How can we ensure that the ECS service doesn’t scale out tasks in an unbounded manner, keeping costs under control?

This realization led us to question our development stack. However, we couldn’t leverage other modes of concurrency in Python because we needed to adhere to the approaches suggested by Pipecat-AI examples. Additionally, switching to a more performant language like Golang wasn't an option as this was going to be integrated into an existing Python codebase. As a result, we were constrained to finding a solution within our existing Python framework while minimizing OS-level forking.

Solution

We were exploring other chatbot solutions and came across this example added by the good folks at fly.io

https://github.com/pipecat-ai/pipecat/tree/main/examples/deployment/flyio-example

It seemed like deployment-at-first-sight of the README which opened with:

"This project modifies the bot_runner.py server to launch a new machine for each user session.This is a recommended approach for production vs. running shell processes as your deployment will quickly run out of system resources under load."

One machine per user session? Isn't that overkill?

We looked a little more at what fly.io had to offer

Curiouser and curiouser. But will it spin up on time and what do we do when they are idle:

Trying fly.io from the CLI

Before integrating their bot code, we decided to test Fly.io to see how it works. The steps were straightforward:

  1. Create an Account: Sign up on Fly.io.

  2. Authenticate via CLI: Use fly auth signup to authenticate.

  3. Generate API Tokens: To use the API, generate tokens with fly auth token.

  4. Create a Fly Configuration File: Set up a configuration file in TOML format. Here’s a sample:

app = 'pipecat-fly-example'
primary_region = 'sjc'

[build]

[env]
  FLY_APP_NAME = 'pipecat-fly-example'

[http_service]
  internal_port = 7860
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = 512
  cpu_kind = 'shared'
  cpus = 1

  1. Deploy the App:

    • If you’re in the same directory as your target Dockerfile, simply run fly launch.

    • Fly.io handles authentication, authorization, builds the image, creates a container, and provides a URL.

      Visit your newly deployed app at: https://testurl.fly.dev/.

While this process was simple enough from the CLI, we then asked ourselves: how can we integrate this into a Python workflow?

Understanding fly.io from the Bot Runner Code

The commands mentioned earlier have corresponding REST API endpoint equivalents, allowing you to perform similar actions programmatically. Here's how you can do it:

  1. Use Fly.io's REST API: Spawn a new machine using the appropriate REST API endpoints.

  2. Obtain the Application URL: Use the URL returned by Fly.io as your application's endpoint.

  3. Interact with Your Application: Make REST API calls against the application URL.

  4. Set Environment Variables:

    FLY_API_HOST=<your_fly_api_host>
    FLY_API_KEY=<your_fly_api_key>
    FLY_APP_NAME
    
    
  5. Set Fly.io Headers:

    FLY_HEADERS = {
        'Authorization': f"Bearer {FLY_API_KEY}",
        'Content-Type': 'application/json'
    
    
  6. Spawn a New Fly.io Machine:

    session.post(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS, json=worker_props)
  7. Connect to Your Bot:

    curl --location --request POST 'https://YOUR_FLY_APP_NAME/start_bot'

Typically, one would assume that the bot code is running inside a container. However, in this case, the code to create the machine is also part of the bot code. This approach requires understanding an additional external API, but it also shifts control over the compute environment to the developer.

So did it work?

Yes. We observed the following improvements:

  1. Faster Machine Spin-Up Time: The time improved from 200-300 seconds to under 100 seconds. Machines for each additional room were created under 10 seconds.

  2. Predictable Scaling: With each room running as a separate machine, we no longer had to worry about the current container's resource limitations.

Was It a Silver Bullet?

Nothing ever is. Offloading compute to an API incurs additional costs beyond what you pay for ECS. The key question is: Is the cost justifiable?

For us, it was manageable. We spent around $20 last month for our scale.

However, we are actively exploring ways to keep costs in check in case of an unexpected surge. We're also investigating options to receive advance notifications if we approach our plan limits, ensuring we can manage any potential overcharges effectively. Here’s the current pricing for fly.io.

Key Takeaways

  1. Use the Right Tool for the Job: Sometimes this means considering tools outside your current toolchain. Don’t be afraid to explore new options that better fit the task at hand.

  2. Conduct Load Testing: Load testing in preview environments allowed us to catch potential scaling issues early, giving us the necessary time to implement solutions before going live.

  3. Understand Tradeoffs: Every solution involves tradeoffs. While we offloaded compute to an external service, it’s crucial to fully understand the implications—such as imposed limits, when they might be hit, and how to respond when they are.

  4. Continuously Monitor and Optimize: Ensure that you're continually monitoring your application's performance and costs, and be proactive in optimizing both as your application scales.

By leveraging Fly.io, the team was able to address the performance issues and ensure a more scalable and responsive chatbot application.

Are you facing similar challenges with your AI chatbot or LLM applications? Do you need expert help in optimizing your deployment on Fly.io or building customized conversational chatbots? Reach out to us at One2N for engineering assistance tailored to your needs. Our team of experienced engineers is here to help you navigate the complexities of LLMops and ensure your applications run smoothly and efficiently.

Share

Jump to Section

Also Checkout

Also Checkout

Also Checkout

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.