The team and the problem
A platform engineering team at a growth-stage B2B company needed to choose a message queue architecture for a new internal event pipeline. The pipeline would handle user activity events, audit logs, and downstream workflow triggers. They had a team of six backend engineers, no dedicated infrastructure resource, and a 90-day timeline to get something into production.
The choice was between two options that had become the focal points of a three-week engineering debate:
- Kafka, self-managed on EKS — preferred by two engineers with prior Kafka experience, valued for throughput ceiling and operational control
- SQS with EventBridge — preferred by three engineers without Kafka experience, valued for managed infrastructure and faster time to operational readiness
Both options were technically viable. This was not an evaluation where one side was wrong. It was a situation where engineers with different risk profiles and different operational intuitions were optimizing for different things — and neither side had made those tradeoffs explicit enough to resolve the disagreement.
After three weeks and no convergence, the lead decided to run a structured debate.
What was making the decision hard
The surface-level disagreement was about Kafka versus SQS. The underlying disagreement was about two competing risk assessments.
The infrastructure-leaning engineers were worried about scale risk: choosing a managed queue now and hitting a throughput ceiling in eighteen months, at which point migration would be painful and expensive. They had seen this happen before at other companies and trusted their intuition that event volume projections always underestimate growth.
The application-leaning engineers were worried about execution risk: committing to self-managed Kafka with a team that had limited Kafka operational experience, on a 90-day timeline, while simultaneously delivering three other product initiatives. They had seen Kafka deployments go badly wrong for exactly this kind of team profile.
Both risks were real. Neither side had quantified their risk assessment in a way that allowed direct comparison.
The real decision underneath
Architecture decisions that stall for more than a week are almost always hiding an unresolved assumption conflict. In this case, the conflict was about probability estimates for scale scenarios — not about which technology was better in the abstract.
The debate structure
The team submitted the decision as a structured debate with the following question:
Given our current team (6 backend engineers, no dedicated infra), a 90-day timeline to production, and an expected event volume of under 500k/day for the first year, should we build on Kafka (self-managed on EKS) or SQS with EventBridge? The cost of reversing this decision after 12 months of usage is approximately 3 engineer-months.
The debate was run with four agents: an advocate and critic for each option. The cross-examination round forced each side to argue the other's strongest case.
The three outputs that mattered
The debate produced a 1,200-word synthesis document. Most of it was expected. Three parts were not.
Explicit assumption inventory
The Kafka advocate's case rested on an assumed throughput ceiling of 2M events/day within 18 months. The SQS advocate's case rested on 200k/day. These projections had never been placed side by side. When they were, it became immediately clear that the disagreement was not about which technology was better — it was about which volume scenario was more likely.
The engineering lead later said this was the moment the decision became tractable: "Once we saw the actual numbers next to each other, we realized we weren't arguing about Kafka vs SQS. We were arguing about whether we were going to grow like Stripe or grow like us."
Probability-weighted expected cost
The debate produced a cost comparison that neither side had attempted in three weeks of discussion:
| Scenario | SQS | Kafka |
|---|---|---|
| Volume stays under 500k/day at 12 months | ~$800/mo; no migration cost | ~$1,200/mo + 20% on-call overhead |
| Volume reaches 1M+/day at 12 months | ~$4,200/mo; migration cost ~3 eng-months | ~$1,800/mo; no migration needed |
| Kafka deployment fails timeline (est. 50% probability) | No impact | Delayed launch + 1–2 eng-months recovery |
At the team's internal estimate of 30% probability that volume would exceed 1M/day within twelve months, the expected-cost advantage of SQS was substantial — once the execution risk of the Kafka deployment was factored in.
Invalidation conditions with monitoring thresholds
The debate output identified a specific threshold: if SQS usage costs exceeded $3,000/month before the 12-month mark, the financial case for migration to Kafka would become compelling regardless of throughput. The team agreed to add this as a monitoring alert.
What monitoring thresholds do
An invalidation condition with a specific monitoring threshold converts a one-time architecture decision into a managed bet. The team is not committing to SQS forever — they are committing to SQS until a defined condition is met, at which point reassessment is automatic. This resolves the "what if we're wrong?" anxiety that keeps architecture debates open long past their natural close.
The outcome
The team chose SQS with EventBridge. The decision was made in approximately one hour after reviewing the debate output — compared to three weeks of inconclusive discussion prior.
The final decision document included:
- The question as submitted to the debate
- The key assumption conflict (volume projections)
- The probability-weighted cost analysis
- The monitoring threshold ($3,000/month or 600k events/day triggering reassessment)
- The names of the engineers who had reviewed and signed off
The pipeline shipped within the 90-day timeline. At the twelve-month mark, volume was tracking at approximately 340k/day. The SQS cost was under $900/month.
What changed about how the team makes decisions
The architectural decision itself was the expected output. The unexpected output was a shift in how the team structured its technical debates.
The engineering lead introduced a lightweight version of the same format for subsequent decisions: any architectural discussion that had been unresolved for more than five business days was required to produce an explicit assumption inventory before the next meeting. The two questions required were: What volume or scale scenario is each option optimized for? and What is the estimated cost of migrating away from each option at twelve months?
This did not replace engineering discussion. It focused it. By the time teams arrived at a decision meeting, the disagreement had been reduced to its core — usually a small number of probability estimates for scenarios where the options diverged — and the conversation could start there.
The broader principle
Architecture decisions do not become easy when you use structured debate. They become tractable. The debate does not tell you which technology to choose — it tells you what you are actually disagreeing about, so that human judgment can be applied where it is actually needed.
Limitations the team observed
The debate assumed accurate inputs. The event volume projections fed into the cost model were the team's internal estimates — not worst-case numbers selected to support a preferred conclusion. The quality of the output depended on the quality of the inputs.
The team's post-decision review identified one data point they had underspecified: the staffing cost of Kafka operations. The debate modeled this as a percentage of on-call load, but the actual cost — in terms of specialized knowledge required, operational runbooks to be built, and incident response complexity — was higher than the model captured.
This did not change the outcome. But it was a reminder that structured debate is better at surfacing assumption conflicts between clearly quantified factors than at surfacing costs that are hard to quantify in advance.