Services

Resources

Company

Our Work

Blog

Schedule a Meet

Back to Blog

Mar 5, 2026 | 4 min read

Why your Architecture should start with Questions, not boxes

Chinmay Naik

CEO @One2N

Back to Blog

Mar 5, 2026 | 4 min read

Why your Architecture should start with Questions, not boxes

Chinmay Naik

CEO @One2N

Back to Blog

Mar 5, 2026 | 4 min read

Why your Architecture should start with Questions, not boxes

Chinmay Naik

CEO @One2N

Recently, I was reviewing an architecture design document. The document had a diagram showing all the components - API Gateway feeding into three microservices, Kafka in the middle, separate databases for each service, Redis for caching, and Kubernetes as an orchestration layer.

What that doc did not have were: questions or assumptions. It just assumed a bunch of business context without specifying any of that.

I asked these questions and more:

What's the actual business problem we're solving?
What are the skill sets of the team that’s supposed to build this software?
How many users do we expect on day 0 and what does the user growth look like?

It turned out that the “architect” hadn’t thought about many of these during their system design. They just proposed a design of the system before really understanding the system constraints. They never asked the questions that mattered. They just drew the boxes that looked "correct" based on what big tech companies do.

I spent the next couple of hours working with the architect to improve the document. I’ve seen this pattern so often that I wrote this post to explain why your architecture should start with questions, not boxes.

Why start with questions first

Before you draw a single box in the architecture document, you need to understand the constraints of the system you’re building. Not the tech and tools you’ll be using, but the constraints. Why, you might ask: because engineering is all about trade-offs, there is no silver bullet. Here’s an example of how you can think questions-first, instead of architecture-first.

Say you’re designing a notification system. Your first instinct might be to reach for Kafka and background workers. But in my opinion, the right first instinct is to ask a bunch of questions like these:

What's the notification volume? Per day, per hour, per second?
What's the delivery SLA? Does it matter if a notification arrives in 1 second vs 10 seconds?
What's the budget for infrastructure?
What happens if a notification fails? Do we retry? If yes, how many times?
What's the expected growth over the next year?
What's the bare minimum V0 that delivers value?

These questions will help you discover the real requirements, things that actually matter for the problem domain. Without these questions, you're designing in the dark. If you know answers to some of these questions (even if not all), it will help you in picking the right tools and technology.

Why good architects ask dumb questions

Some of the best architects that I have worked with had this one trait. They asked questions that sound almost naive:

Why do we need this to be real-time?
What happens if we just don't build this feature?
Can we start with a cronjob and see if that works?

These questions annoy people who want to jump to solutions. But these are the questions that prevent you from over-engineering a system with a Kafka cluster when a simple database table would work fine.

There’s also a knack for asking these questions. I tell engineers all the time:

There's a difference between curious questions and performance questions
Curious questions sound like:
Help me understand why we chose this approach?
What problem were we originally trying to solve?
What happens if we don't do this at all?
Performance questions sound like:
Why didn't anyone think of this obvious solution?
Shouldn't we be using X instead?
Who decided this was a good idea?

Curious questions help discover hidden assumptions in the system and allow teams to see things in a different, collaborative way. They make it safe to say "I don't know yet, but let’s figure it out”.

Performance questions make teams defensive. They turn the team conversations into a blame-game mode.

If you want to design good systems, you have to create space for the dumb-sounding questions.

Production is where you learn which questions and decisions matter

Here's the thing about questions that’s equally applicable to decisions we take as software architects. You learn which ones matter by staying with systems long enough to see it fail. For people who change jobs every 1.5 years for their 10-year career, they don’t accumulate 10 years of experience. They just get 1.5 years of experience repeated 7 times. You need to stick around with a company and the architecture you build to see the effects of your decision. This is one of the reasons, at One2N, we follow “You build it, you run it” mentality.

Many so-called “architects” never ask about observability, security and other non-functional requirements. I used to be one of them. I’d design systems, ship them, and move on. Then I'd get a call at 2 AM because something was down, and I'd realize my beautiful architecture had zero visibility into what was actually happening.

Now "How will we know when this is broken?" is one of my first questions.

You learn this by getting into the water. You can't learn it from blogs or newsletters or whiteboard interviews. You have to build systems, watch them run in production, and see where your assumptions were wrong.

When you stay with a system for a couple of years, you start asking different questions:

How do we deploy this without downtime?
What happens when this database grows to 2TB?
How do we debug this when Kubernetes decides to restart a pod?
What's our plan when the upstream API starts rate-limiting us?

These aren't questions you think to ask when you're drawing boxes. They're questions you learn to ask after the boxes have failed you in production.

The constraint questions you should start asking

Hopefully, you understand by now - why you should start your architecture with questions instead of boxes. To do this, here's a checklist I recommend as a starting point. Start thinking about questions in these aspects. Start with this checklist and expand more as you discover more business context as per your domain. Here’s the checklist of questions:

Volume and Scale:
What's the current data volume?
What's the expected growth rate?
What's the read/write ratio?

Performance:
What's the acceptable latency (P90, P99) for API calls?
What's the throughput requirement?
Are there any hard SLAs that the business has to commit to?

Reliability:
What's the cost of downtime?
What's the cost of data loss?
What's our actual uptime requirement?

Operational:
What's our team's expertise and who’s going to maintain the solution?
What are the budget constraints and unit economics?
What's already running in production that we can continue to use?

Business:
What's the bare minimum V0 that delivers value?
What can we defer to V1?
What's the actual problem we're solving?

Answer these before you draw anything. If you can't answer them, go find out. If no one can answer them, that's a red flag that you're solving the wrong problem.

If you're an architect who wants to stop drawing boxes and start asking questions, we're hiring. Check out our careers page.

What that doc did not have were: questions or assumptions. It just assumed a bunch of business context without specifying any of that.

I asked these questions and more:

What's the actual business problem we're solving?
What are the skill sets of the team that’s supposed to build this software?
How many users do we expect on day 0 and what does the user growth look like?

Why start with questions first

What's the notification volume? Per day, per hour, per second?
What's the delivery SLA? Does it matter if a notification arrives in 1 second vs 10 seconds?
What's the budget for infrastructure?
What happens if a notification fails? Do we retry? If yes, how many times?
What's the expected growth over the next year?
What's the bare minimum V0 that delivers value?

Why good architects ask dumb questions

Some of the best architects that I have worked with had this one trait. They asked questions that sound almost naive:

Why do we need this to be real-time?
What happens if we just don't build this feature?
Can we start with a cronjob and see if that works?

There’s also a knack for asking these questions. I tell engineers all the time:

There's a difference between curious questions and performance questions
Curious questions sound like:
Help me understand why we chose this approach?
What problem were we originally trying to solve?
What happens if we don't do this at all?
Performance questions sound like:
Why didn't anyone think of this obvious solution?
Shouldn't we be using X instead?
Who decided this was a good idea?

Performance questions make teams defensive. They turn the team conversations into a blame-game mode.

If you want to design good systems, you have to create space for the dumb-sounding questions.

Production is where you learn which questions and decisions matter

Now "How will we know when this is broken?" is one of my first questions.

When you stay with a system for a couple of years, you start asking different questions:

How do we deploy this without downtime?
What happens when this database grows to 2TB?
How do we debug this when Kubernetes decides to restart a pod?
What's our plan when the upstream API starts rate-limiting us?

These aren't questions you think to ask when you're drawing boxes. They're questions you learn to ask after the boxes have failed you in production.

The constraint questions you should start asking

Volume and Scale:
What's the current data volume?
What's the expected growth rate?
What's the read/write ratio?

Performance:
What's the acceptable latency (P90, P99) for API calls?
What's the throughput requirement?
Are there any hard SLAs that the business has to commit to?

Reliability:
What's the cost of downtime?
What's the cost of data loss?
What's our actual uptime requirement?

Operational:
What's our team's expertise and who’s going to maintain the solution?
What are the budget constraints and unit economics?
What's already running in production that we can continue to use?

Business:
What's the bare minimum V0 that delivers value?
What can we defer to V1?
What's the actual problem we're solving?

Answer these before you draw anything. If you can't answer them, go find out. If no one can answer them, that's a red flag that you're solving the wrong problem.

If you're an architect who wants to stop drawing boxes and start asking questions, we're hiring. Check out our careers page.

In this post

Section

In this post

section

Keywords

Architecture Decision Records, Engineering Trade-offs, Over-engineering Prevention, Software Architecture, System Design, Technical Leadership, Engineering Career Growth, Production Readiness, Why Software Architecture Should Start with Questions, Not Boxes

Continue reading.

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Read Blog

How Queueing Theory Makes Systems Reliable

One2N

Team

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

December 17, 2025 | 2 min read

Read Blog

How to read SRE graphs without lying to yourself

One2N

Team

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

December 10, 2025 | 3 min read

Read Blog

Error Budget Calculation: Downtime Minutes for every SLO

One2N

Team

Turn your SLO into something you can argue about in a meeting: this guide shows how to convert 99.9% into 43 real minutes of downtime, read burn rate, push back on “five nines,” and decide when to ship or hit pause.

December 3, 2025 | 3 min read

Read Blog

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

Harshwardhan Mehrotra

SRE @One2N

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

February 16, 2026 | 4 min read

Read Blog

Prayogshala - The Engineering Laboratory at One2N

Chinmay Naik

CEO @One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

February 2, 2026 | 4 min read

Read Blog

The Gotchas of OTEL collector processors for effective observability in K8s

Sanket Rajgiri

SRE @One2N

Spandan Ghosh

Content @One2N

Struggling to make sense of OpenTelemetry Collector processors for real-world projects? This blog breaks down what each OTEL processor actually does, where it matters, and shares real lessons from messy SRE problems like taming noisy data, surviving crashes, and staying under cost limits in Kubernetes.

January 26, 2026 | 6 min read

Read Blog

How Queueing Theory Makes Systems Reliable

One2N

Team

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

December 17, 2025 | 2 min read

Blogs

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Subscribe for more such content

Get the latest in software engineering best practices straight to your inbox. Subscribe now!

Services

Resources

Company

Why your Architecture should start with Questions, not boxes

Why your Architecture should start with Questions, not boxes

Why your Architecture should start with Questions, not boxes

Why your Architecture should start with Questions, not boxes

In this post

In this post

Section

Share

Share

Tags

In this post

Share

Tags

Keywords

Continue reading.

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

Prayogshala - The Engineering Laboratory at One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

The Gotchas of OTEL collector processors for effective observability in K8s

How Queueing Theory Makes Systems Reliable

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

How to read SRE graphs without lying to yourself

Are your SRE charts messing with your head? We’ll show you step by step how to actually make sense of those dashboards: percentiles, averages, heatmaps, and all, so you spot real issues fast. No jargon, just practical advice from daily SRE work.

Error Budget Calculation: Downtime Minutes for every SLO

Turn your SLO into something you can argue about in a meeting: this guide shows how to convert 99.9% into 43 real minutes of downtime, read burn rate, push back on “five nines,” and decide when to ship or hit pause.

Gitops for Kafka in the real world: How we governed 78 clusters without breaking production

This post is about how we, as a lean platform engg team used gitops and jikkou to bring 78 kafka clusters under control, cut ticket based changes from days to minutes, and kept legacy kafka 0.8.2 clusters safe in production.

Prayogshala - The Engineering Laboratory at One2N

We created Prayogshala, One2N’s internal engineering lab, to capture our learnings, experiments and knowledge base. See how it helps engineers get better by asking "why" before "how".

The Gotchas of OTEL collector processors for effective observability in K8s

How Queueing Theory Makes Systems Reliable

Learn how SREs use queueing theory to explain why 70 percent utilisation feels calm, 90 percent feels cursed, and how the same math helps you choose headroom, tame retries, and protect your error budget before incidents hit.

Subscribe for more such content

Hold to verify for 2 seconds

Subscribe for more such content

Hold to verify for 2 seconds

Subscribe for more such content

Hold to verify for 2 seconds