The question that wasted everyone's time
A VP of Engineering at a Series B company submitted this question to their team's AI debate tool:
Should we adopt a microservices architecture?
The output came back: a balanced analysis of microservices versus monoliths, references to Martin Fowler's writings, a recommendation to "consider a hybrid approach," and a note that "the right answer depends on your team's specific situation."
It was accurate. It was not useful. The team already knew microservices had tradeoffs. They were not asking whether microservices was generally a good idea. They were asking whether they should migrate their specific system, with their specific team, under their specific constraints — and none of that was in the question.
Two weeks later, after a frustrating planning session, someone rephrased it:
We have a 140k-line Rails monolith, a team of 8 backend engineers (2 of whom have microservices experience), a target of 3 new product launches in the next 12 months, and an on-call rotation that already absorbs 20% of engineering time. Should we begin decomposing into microservices now, or extend the monolith for the next 18 months and reassess? The cost of starting now and reversing is roughly 6 engineer-months. The cost of waiting and needing to migrate later under growth pressure is unknown but probably higher.
The debate output was different in kind, not just in degree. It engaged the actual constraints. It identified that two of the six risk factors for microservices migration failure were present in this team's situation. It recommended waiting — with a specific monitoring condition (on-call load dropping below 12%) that would trigger reassessment.
The question did not just add detail. It changed what the system was able to evaluate.
Why most decision questions fail
Most teams have never been taught to write decision questions. They write topics — areas of uncertainty that could generate discussion — and expect the AI to turn those into decisions. The gap between a topic and a decision question is where most of the value gets lost.
The four most common failures
These patterns consistently produce low-quality debate output, regardless of how good the underlying AI system is.
Too broad. A topic masquerading as a question. "Should we expand internationally?" is a conversation starter, not a decision question. A decision question specifies what expansion means, where, under what constraints, and at what cost.
No constraints. "Is this the right vendor?" is unanswerable without knowing what "right" means in this context. Right at what price? Against which alternatives? On what timeline? With what integration requirements?
Embedded conclusion. "How should we best implement real-time notifications?" assumes real-time notifications are the right solution. A real decision question challenges the assumption: "Should we implement real-time notifications via WebSockets, or is polling every 30 seconds sufficient for our user base's actual needs?"
Missing stakes. The debate system performs substantially better when it knows why the decision matters. A question that includes the cost of being wrong — in time, money, or opportunity — anchors the agents toward useful analysis. Without stakes, you get a literature review, not a decision.
The four-component framework
A good decision question for structured AI debate has four components. Every component makes the output more actionable.
| Component | What it does | Example |
|---|---|---|
| The choice | Defines what is actually being decided between | "WebSockets vs. 30-second polling" |
| The constraints | Sets the boundaries the answer must satisfy | "4-person team, 6-week deadline, no DevOps resource" |
| The success criteria | Defines what winning looks like | "Sub-2s latency for 95% of users; zero new on-call alerts" |
| The cost of being wrong | Anchors the analysis to actual stakes | "If WebSockets is wrong: 3 months of rework. If polling is wrong: 6 months of user complaints and churn." |
A question with all four components produces a verdict with explicit confidence, the strongest argument against the recommendation, and the conditions that would invalidate it. A question missing any component tends to produce hedged analysis that feels thorough but cannot be acted on.
Comparing bad and good questions
Here are side-by-side examples across different decision types. The bad versions are not hypothetical — they are composites of actual questions that produced low-value output.
| Bad question | Good question |
|---|---|
| Should we use Postgres or MongoDB? | We have 200k records, complex relational data, 3 engineers with strong SQL backgrounds, and a 90-day deadline. Postgres or MongoDB for the primary store? Migration cost after 12 months is estimated at 4 engineer-weeks. |
| Should we hire a head of sales? | We are at $800k ARR, 60% from inbound, CEO is closing all enterprise deals. Do we hire a VP Sales now at $180k+equity, or promote an existing AE and reassess at $1.5M ARR? Wrong hire costs 6+ months of runway. |
| Is this the right GTM strategy? | Given 14k free users, a 6-person sales team, and a Q2 target of 30 new Pro customers at $200/mo, which motion — self-serve conversion, outbound to mid-market, or content-led demand gen — gives us the highest probability of hitting the target? |
| Should we rebuild the mobile app? | Our React Native app has a 2.1-star rating, 40% of support tickets are app-related, and our main competitor launched a native app last quarter. Do we invest 4 months rebuilding in Swift/Kotlin, or fix the top 10 issues in React Native? Rebuild cost is 2 engineers for 4 months. Patch cost is 1 engineer for 6 weeks. |
The pattern is consistent: specific constraints, named alternatives, explicit success criteria, and a cost for getting it wrong.
When to attach documents
For decisions involving existing material — contracts, technical specs, financial models, vendor proposals — attaching the source document substantially improves the quality of debate output.
Without the document, agents generate plausible assumptions about what the document probably says. With the document, they engage the specific clauses, numbers, and commitments.
High-value cases for attachments
Document attachment matters most for: vendor or procurement decisions (attach the proposals and RFP), legal or compliance questions (attach the specific contract or regulation), architecture decisions (attach the existing ADR or technical spec), and financial decisions (attach the model with assumptions visible).
A debate that engages "the vendor's contract has a 30-day termination clause with a 3-month wind-down period" is more useful than a debate that reasons from "typical SaaS contracts usually include...".
Writing the question: a template
If you are not sure where to start, this template covers the essential components:
We are deciding between [Option A] and [Option B].
Our constraints are: [list the actual constraints — budget, timeline, team size, technical environment].
Success looks like: [what does the right decision enable?].
If we choose [Option A] and it is wrong: [specific downside — time, money, relationship].
If we choose [Option B] and it is wrong: [specific downside].
Given these constraints, which option is more likely to produce [the success criterion]?This is not the only format. But it forces you to articulate the constraints and costs before you run the debate — and that exercise alone frequently clarifies whether you actually have a decision to make or whether more information is needed first.
Reading the output
A well-formed verdict from a structured debate with a good question includes:
- A recommendation with explicit confidence — not just a direction but a degree of certainty
- The strongest argument against the recommendation — what the losing side got right
- Invalidation conditions — the specific scenarios under which the recommendation is wrong
- The next-best information — what data point, if obtained, would most change the confidence level
Items three and four are the most operationally useful. Invalidation conditions tell you what to monitor after the decision is made. Next-best information tells you whether it is worth delaying the decision to gather more data first.
A verdict that lacks these last two components usually means the question was underspecified. The system could not derive the conditions because the stakes were not clear enough in the input.
When that happens, the fix is almost always to add more constraints and more explicit costs to the question — not to run the debate again with different agents.