Services

Resources

Company

Our Work

Blog

Schedule a Meet

Back to Talks

ABC of LLMOps - What does it take to run self-hosted LLMs｜Rootconf mini 2024

Jaideep Khandelwal

CTO @One2N

Feb 5, 2025

Running Self-Hosted LLMs in Production: An SRE’s Experiment

Lessons from deploying open-source models on Kubernetes GPUs, managing vector DBs & building RAG apps

🧠 Why We Did This

Most companies today rely on OpenAI APIs for GenAI workflows but what happens when you need control over data privacy, costs, or custom models? As SREs managing backend systems at scale, we wanted answers to:

“How do you run LLMOps pipelines without depending on OpenAI?”
“What infrastructure gaps emerge when moving from prototypes to production?”
“Is owning GPU hardware better than cloud for steady-state workloads?”

This talk documents our 6-month experiment to learn LLMOps from first principles while building internal tools like a resume-filtering RAG app.

The Experiment

Phase 1: Learning First Principles

Models: Started with lightweight models (Phi3) → progressed to Llama3 & Mistral for complex tasks.
Toolchain: Tested LangChain → hit limitations → migrated core logic to LlamaIndex for production needs.
Vector DBs: Ran QDrant locally → stress-tested embedding storage/retrieval latency at scale.

Phase 2: Building Infrastructure Muscle

GPUs on Kubernetes: Deployed Ray/KubeRay clusters → optimized GPU utilization vs cost tradeoffs.
Observability: Added metrics for prompt latency, token usage & DB query performance early (critical!).

Phase 3: Shipping Real Workflows

Built a resume-filtering RAG app (dogfooded internally).
Lessons learned: Prompt engineering ≠ one-time effort; versioning embeddings matters; cold starts hurt UX.

Key Takeaways

Start Small but Think Production
Toy apps → internal tools → customer-facing pipelines requires rethinking infra (e.g., scaling vector DBs).
Own Your Stack If…
- Compliance/data privacy is non-negotiable
- Steady-state inference demand justifies GPU capex
Avoid Framework Lock-In
LangChain is great for prototyping but frameworks like LlamaIndex offer better control for SREs managing uptime/SLAs.

Who Should Care?

This talk isn’t about AI theory it’s a playbook for engineers tasked with operationalizing LLMs:

SREs/DevOps teams planning GPU clusters or hybrid cloud AI infra
Engineers struggling with OpenAI API costs/limitations
Teams building RAG apps that need vector DB + model tuning expertise

Continue watching

Watch Talk

Solving Synthetic Data generation using LLMs | mitramadal.ai EP2 - 2026

Chinmay Naik

CEO @One2N

In this talk at mitramandal.ai EP2, Chinmay Naik (Founder & CEO, One2N) walks through how One2N solved the problem of synthetic data generation using LLMs - covering the real-world use cases that drove it, why anonymization falls short, and how they built a tool that uses LLMs for schema generation and deterministic faker-based data generation.

February 21, 2026 | 36 min

Watch Talk

.zshrc tmuxinator tease | Command Line Heroes EP3 - Bengaluru

Saurabh Hirani

Principal SRE @One2N

Command Line Heroes is a meetup series for developers, SREs, and terminal nerds who love living inside the shell. In episode 3 from Bengaluru, Saurabh Hirani (Principal SRE at One2N) walks through a real-world debugging story where Kiro CLI quietly breaks his beloved Tmuxinator setup. This talk is a deep dive into how a “simple” autocomplete feature ended up sitting between Ghostty/VS Code, tmux, and zsh, and why things started to go weird: tmux panes showed commands but refused to execute them, shell startup got slower, and mysterious processes began to appear. Along the way, Saurabh uses LLMs, exec, and process inspection to peel back the layers and explain what’s really happening inside your terminal sessions. CLI Heroes Ep3 was an in person meetup held on 31st Jan 2026, thanks to Smarsh Inc for providing the venue, this is just a small snippet of the things that happened at the meetup. Make sure to follow One2N events on luma to be notified of upcoming events: https://one2n.io/meetups

January 31, 2026 | 35 min

Watch Talk

Using LLMs to Improve Incident Response Times | mitramadal.ai - 2025

Jaideep Khandelwal

CTO @One2N

mitramandal is an AI meetup by One2N where engineers share how they've actually integrated AI into their daily development workflows. every developer is solving applied AI differently: some are orchestrating agentic pipelines across MCPs, others are wiring real-time RAG loops into Cursor or Claude, and more are operationalizing multimodal copilots inside CI/CD. we've seen fine-tuning flywheels, retrieval sandboxes, and autonomous evaluation harnesses that ship to prod every week. Talk 1: Using LLMs to Improve Incident Response Times - Jaideep Khandelwal : https://youtu.be/f_nV4GyWO0k?si=Zt_PnLzongzt4E9q Talk 2: The long way to production - Aditya Kasibhatla : https://youtu.be/SWW9rEzgxvA?si=Tr_Z_hz0gAb0JAqd Talk 3: vMCP LEGO blocks for AI workflows and agents - Shiladitya Banerjee : https://youtu.be/Fi8j3NG7WYA?si=xHRJwandjrJLUo0N

December 6, 2025 | 37 min

Watch Talk

Solving Synthetic Data generation using LLMs | mitramadal.ai EP2 - 2026

Chinmay Naik

CEO @One2N

In this talk at mitramandal.ai EP2, Chinmay Naik (Founder & CEO, One2N) walks through how One2N solved the problem of synthetic data generation using LLMs - covering the real-world use cases that drove it, why anonymization falls short, and how they built a tool that uses LLMs for schema generation and deterministic faker-based data generation.

February 21, 2026 | 36 min

Watch Talk

.zshrc tmuxinator tease | Command Line Heroes EP3 - Bengaluru

Saurabh Hirani

Principal SRE @One2N

Command Line Heroes is a meetup series for developers, SREs, and terminal nerds who love living inside the shell. In episode 3 from Bengaluru, Saurabh Hirani (Principal SRE at One2N) walks through a real-world debugging story where Kiro CLI quietly breaks his beloved Tmuxinator setup. This talk is a deep dive into how a “simple” autocomplete feature ended up sitting between Ghostty/VS Code, tmux, and zsh, and why things started to go weird: tmux panes showed commands but refused to execute them, shell startup got slower, and mysterious processes began to appear. Along the way, Saurabh uses LLMs, exec, and process inspection to peel back the layers and explain what’s really happening inside your terminal sessions. CLI Heroes Ep3 was an in person meetup held on 31st Jan 2026, thanks to Smarsh Inc for providing the venue, this is just a small snippet of the things that happened at the meetup. Make sure to follow One2N events on luma to be notified of upcoming events: https://one2n.io/meetups

January 31, 2026 | 35 min

Watch Talk

Using LLMs to Improve Incident Response Times | mitramadal.ai - 2025

Jaideep Khandelwal

CTO @One2N

mitramandal is an AI meetup by One2N where engineers share how they've actually integrated AI into their daily development workflows. every developer is solving applied AI differently: some are orchestrating agentic pipelines across MCPs, others are wiring real-time RAG loops into Cursor or Claude, and more are operationalizing multimodal copilots inside CI/CD. we've seen fine-tuning flywheels, retrieval sandboxes, and autonomous evaluation harnesses that ship to prod every week. Talk 1: Using LLMs to Improve Incident Response Times - Jaideep Khandelwal : https://youtu.be/f_nV4GyWO0k?si=Zt_PnLzongzt4E9q Talk 2: The long way to production - Aditya Kasibhatla : https://youtu.be/SWW9rEzgxvA?si=Tr_Z_hz0gAb0JAqd Talk 3: vMCP LEGO blocks for AI workflows and agents - Shiladitya Banerjee : https://youtu.be/Fi8j3NG7WYA?si=xHRJwandjrJLUo0N

December 6, 2025 | 37 min

Watch Talk

Extreme Programming in the Age of AI | One2N Cursor Meetup Pune

Nilesh D

Software Engineer @One2N

Tune in to this session from the first official Cursor Meetup in Pune, hosted at One2N, where Nilesh D, Software Engineer at One2N explores how Extreme Programming principles can level up your dev workflows and integrate with AI tools like Cursor.

September 27, 2025 | 42 min