About the role:
We are looking for a Staff Site Reliability Engineer who can operate at a staff level across multiple teams and clients. If you care about designing reliable platforms, influencing system architecture, and raising reliability standards across teams, you’ll enjoy working at One2N.
At One2N, you will work with our startups and enterprise clients, solving One-to-N scale problems where the proof of concept is already established and the focus is on scalability, maintainability, and long-term reliability. In this role, you will drive reliability, observability, and infrastructure architecture across systems, influencing design decisions, defining best practices, and guiding teams to build resilient, production-grade systems.
Key responsibilities:
Own and drive reliability and infrastructure strategy across multiple products or client engagements
Design and evolve platform engineering and self-serve infrastructure patterns used by product engineering teams
Lead architecture discussions around observability, scalability, availability, and cost efficiency.
Define and standardize monitoring, alerting, SLOs/SLIs, and incident management practices.
Build and review production-grade CI/CD and IaC systems used across teams
Act as an escalation point for complex production issues and incident retrospectives.
Partner closely with engineering leads, product teams, and clients to influence system design decisions early.
Mentor young engineers through design reviews, technical guidance, and best practices.
Improve Developer Experience (DX) by reducing cognitive load, toil, and operational friction.
Help teams mature their on-call processes, reliability culture, and operational ownership.
Stay ahead of trends in cloud-native infrastructure, observability, and platform engineering, and bring relevant ideas into practice
About you:
8+ years of experience in SRE, DevOps, or software engineering roles
Strong experience designing and operating Kubernetes-based systems on AWS at scale
Deep hands-on expertise in observability and telemetry, including tools like OpenTelemetry, Datadog, Grafana, Prometheus, ELK, Honeycomb, or similar.
Proven experience with infrastructure as code (Terraform, Pulumi) and cloud architecture design.
Strong understanding of distributed systems, microservices, and containerized workloads.
Ability to write and review production-quality code (Golang, Python, Java, or similar)
Solid Linux fundamentals and experience debugging complex system-level issues
Experience driving cross-team technical initiatives.
Excellent analytical and problem-solving skills, keen attention to detail, and a passion for continuous improvement.
Strong written, communication, and collaboration skills, with the ability to work effectively in a fast-paced, agile environment.
Nice to have:
Experience working in consulting or multi-client environments.
Exposure to cost optimization, or large-scale AWS account management
Experience building internal platforms or shared infrastructure used by multiple teams.
Prior experience influencing or defining engineering standards across organizations.
Staff Site Reliability Engineer
Full-time, Location: Pune/Bangalore
Allows Remote
Staff Site Reliability Engineer
Staff Site Reliability Engineer
Staff Site Reliability Engineer
Senior Site Reliability Engineer
Senior Site Reliability Engineer
Senior Site Reliability Engineer
Looking for other roles?
Looking for other roles?
Looking for other roles?
Chekout our Careers Page
Chekout our Careers Page
Chekout our Careers Page