Services

Resources

Company

ABC of LLMOps - What does it take to run self-hosted LLMs|Rootconf mini 2024

Feb 5, 2025

Running Self-Hosted LLMs in Production: An SRE’s Experiment

Lessons from deploying open-source models on Kubernetes GPUs, managing vector DBs & building RAG apps

Slides:

🧠 Why We Did This

Most companies today rely on OpenAI APIs for GenAI workflows but what happens when you need control over data privacy, costs, or custom models? As SREs managing backend systems at scale, we wanted answers to:

  • “How do you run LLMOps pipelines without depending on OpenAI?”

  • “What infrastructure gaps emerge when moving from prototypes to production?”

  • “Is owning GPU hardware better than cloud for steady-state workloads?”

This talk documents our 6-month experiment to learn LLMOps from first principles while building internal tools like a resume-filtering RAG app.

The Experiment

Phase 1: Learning First Principles

  • Models: Started with lightweight models (Phi3) → progressed to Llama3 & Mistral for complex tasks.

  • Toolchain: Tested LangChain → hit limitations → migrated core logic to LlamaIndex for production needs.

  • Vector DBs: Ran QDrant locally → stress-tested embedding storage/retrieval latency at scale.

Phase 2: Building Infrastructure Muscle

  • GPUs on Kubernetes: Deployed Ray/KubeRay clusters → optimized GPU utilization vs cost tradeoffs.

  • Observability: Added metrics for prompt latency, token usage & DB query performance early (critical!).

Phase 3: Shipping Real Workflows

  • Built a resume-filtering RAG app (dogfooded internally).

  • Lessons learned: Prompt engineering ≠ one-time effort; versioning embeddings matters; cold starts hurt UX.

Key Takeaways

  1. Start Small but Think Production
    Toy apps → internal tools → customer-facing pipelines requires rethinking infra (e.g., scaling vector DBs).

  2. Own Your Stack If…

    • Compliance/data privacy is non-negotiable

    • Steady-state inference demand justifies GPU capex

  3. Avoid Framework Lock-In
    LangChain is great for prototyping but frameworks like LlamaIndex offer better control for SREs managing uptime/SLAs.

Who Should Care?

This talk isn’t about AI theory it’s a playbook for engineers tasked with operationalizing LLMs:

  • SREs/DevOps teams planning GPU clusters or hybrid cloud AI infra

  • Engineers struggling with OpenAI API costs/limitations

  • Teams building RAG apps that need vector DB + model tuning expertise

Continue watching

Watch Talk

Kubernetes for Hybrid Cloud Environments - Harshwardhan Mehrotra - #60 Kubernetes Pune Meetup

Harshwardhan Mehrotra

SRE @One2N

Scaling Kubernetes workloads across 40+ data centers is hard – especially when you must meet strict data residency and latency requirements while still leveraging the public cloud. In this talk from the Kubernetes Pune Meetup, Harshwardhan Mehrotra (Site Reliability Engineer at One2N) walks through how his team designed and operated an EKS hybrid setup for a large betting platform with data centers across the US, UK, and Europe. You’ll see how they connected on‑prem worker nodes to an AWS EKS control plane, handled networking at scale, and kept the developer experience close to a “normal” EKS cluster. What you’ll learn: - Why EKS hybrid was chosen over fully on‑prem or fully cloud, and how regulatory and latency constraints shaped the architecture. - How to design pod vs node networking, routable vs non‑routable pod networks, and when to bring in BGP. - How to connect 40+ data centers to AWS using Direct Connect / site‑to‑site VPN and Cilium/Calico CNIs. - How to expose apps using F5 and Istio/NGINX ingress when ALB is not an option. - Real‑world issues with DNS (CoreDNS, Route 53 limits, node‑local DNS) and traffic distribution, plus how they fixed them. - Lessons on egress control, firewall bottlenecks, add‑on placement (Argo CD, KEDA, Prometheus, etc.), and building repeatable playbooks for on‑prem nodes. - This talk is ideal for platform engineers, SREs, and architects running Kubernetes across data centers and cloud, or evaluating EKS hybrid for regulated workloads.

Watch Talk

Kubernetes for Hybrid Cloud Environments - Harshwardhan Mehrotra - #60 Kubernetes Pune Meetup

Harshwardhan Mehrotra

SRE @One2N

Scaling Kubernetes workloads across 40+ data centers is hard – especially when you must meet strict data residency and latency requirements while still leveraging the public cloud. In this talk from the Kubernetes Pune Meetup, Harshwardhan Mehrotra (Site Reliability Engineer at One2N) walks through how his team designed and operated an EKS hybrid setup for a large betting platform with data centers across the US, UK, and Europe. You’ll see how they connected on‑prem worker nodes to an AWS EKS control plane, handled networking at scale, and kept the developer experience close to a “normal” EKS cluster. What you’ll learn: - Why EKS hybrid was chosen over fully on‑prem or fully cloud, and how regulatory and latency constraints shaped the architecture. - How to design pod vs node networking, routable vs non‑routable pod networks, and when to bring in BGP. - How to connect 40+ data centers to AWS using Direct Connect / site‑to‑site VPN and Cilium/Calico CNIs. - How to expose apps using F5 and Istio/NGINX ingress when ALB is not an option. - Real‑world issues with DNS (CoreDNS, Route 53 limits, node‑local DNS) and traffic distribution, plus how they fixed them. - Lessons on egress control, firewall bottlenecks, add‑on placement (Argo CD, KEDA, Prometheus, etc.), and building repeatable playbooks for on‑prem nodes. - This talk is ideal for platform engineers, SREs, and architects running Kubernetes across data centers and cloud, or evaluating EKS hybrid for regulated workloads.