The Arbiter Has Spoken
March 14, 2026. San Francisco. Twenty-three teams walked into NEBULA:FOG:SINGULARITY with 24 hours of code and a live demo slot. What they didn’t expect: an AI judge watching every second.
The Arbiter is an autonomous AI judging system built specifically for this event. It listens to each demo in real-time through the Gemini Live API, captures presenter audio and screen content, detects prompt injection attempts (yes, people tried to hack the judge), and delivers sharp, British-accented commentary to the audience the moment each demo ends. It does not wait. It does not deliberate politely. It tells you exactly what it thinks.
Under the hood, every demo is scored by a multi-model ensemble — Gemini, Claude, and Groq each independently evaluate the team, then their scores are aggregated with outlier detection to produce a single balanced verdict. Three criteria: Technical Execution (40%), Innovation (30%), and Demo Quality (30%), plus a track-specific bonus worth up to 10%.
After all 25 demos, The Arbiter ran a cross-team comparative deliberation — analyzing every team against every other team to produce the final rankings. Below are the full, unfiltered results: the scoreboard, per-team breakdowns with criterion-level scores and AI-generated justifications, The Arbiter’s deliberation narrative, and the themes that emerged. Click any team card to expand their detailed analysis.
Podium
AgentRange
DoYouKnowWhatYouBuiltLastSummer
Genomics
Full Scoreboard
| # | Team | Track | Tech | Innov | Demo | Total |
|---|---|---|---|---|---|---|
| 1 | AgentRange | ROGUE::AGENT | 7.9 | 7.0 | 7.5 | 8.3 |
| 2 | DoYouKnowWhatYouBuiltLastSummer | ROGUE::AGENT | 7.9 | 7.0 | 7.5 | 8.3 |
| 3 | Genomics | ROGUE::AGENT | 7.0 | 7.9 | 6.5 | 7.9 |
| 4 | Igor | ROGUE::AGENT | 6.8 | 7.7 | 5.8 | 7.5 |
| 5 | Starcraft | ROGUE::AGENT | 6.1 | 5.8 | 5.5 | 6.5 |
| 6 | TeamKickass | SENTINEL::MESH | 6.0 | 5.5 | 5.5 | 6.3 |
| 7 | Kavin | ROGUE::AGENT | 5.0 | 7.0 | 4.5 | 6.1 |
| 8 | GossipProblem | ZERO::PROOF | 5.0 | 6.5 | 4.0 | 5.7 |
| 9 | WeNeedaName | ROGUE::AGENT | 5.0 | 5.5 | 4.0 | 5.3 |
| 10 | KC2 | ROGUE::AGENT | 5.0 | 4.5 | 5.0 | 5.2 |
| 11 | Overwatch | ROGUE::AGENT | 3.6 | 5.0 | 3.1 | 4.2 |
| 12 | KC | ROGUE::AGENT | 3.2 | 3.6 | 3.7 | 3.8 |
| 13 | AgentTrustGateway | ROGUE::AGENT | 3.0 | 5.0 | 2.1 | 3.7 |
| 14 | Tobias | ROGUE::AGENT | 2.5 | 5.0 | 2.1 | 3.5 |
| 15 | Winston | ROGUE::AGENT | 3.1 | 2.6 | 3.6 | 3.3 |
| 16 | RickToday | ROGUE::AGENT | 2.5 | 4.5 | 1.6 | 3.2 |
| 17 | Gabo | ROGUE::AGENT | 2.1 | 3.5 | 2.1 | 2.7 |
| 18 | ThirdParty | ROGUE::AGENT | 1.6 | 2.1 | 1.6 | 1.9 |
| 19 | Ticket Security Incite | ROGUE::AGENT | 1.6 | 1.6 | 1.1 | 1.6 |
| 20 | TeamPie | SHADOW::VECTOR | 1.6 | 1.6 | 1.1 | 1.5 |
| 21 | MagicThing | ROGUE::AGENT | 1.6 | 1.1 | 1.1 | 1.4 |
| 22 | Rick | ROGUE::AGENT | 1.6 | 1.1 | 1.1 | 1.4 |
| 23 | Test | ROGUE::AGENT | 1.4 | 1.1 | 1.4 | 1.4 |
Team Breakdowns
Click any team to expand their detailed scores and Arbiter analysis.
The Arbiter’s Deliberation
Notable Themes
- Catastrophic presentation technical failures: 15+ teams experienced OBS placeholder screens, muted cameras, or complete video feed loss, suggesting systematic issues with streaming infrastructure or setup guidance
- Audio quality as the great differentiator: Teams with clear audio (Kavin, WeNeedaName, ThirdParty-shapor) could at least communicate ideas despite visual failures, while those with corrupted audio (Igor, Starcraft, Gabo) became incomprehensible regardless of demo length
- Supply chain security convergence: Multiple teams (WeNeedaName, ThirdParty, ThirdParty-shapor) independently tackled npm/dependency security, suggesting this is a recognized pain point, but with wildly divergent execution quality
- Track misalignment: Many ROGUE::AGENT teams built generic security tools (email phishing, social media content moderation) without clear connection to AI agent-specific threats, suggesting either track requirements were unclear or teams pivoted from original ideas
- The 'demo duration paradox': Longer demos (600s) didn't correlate with higher scores—Kavin and Starcraft used full time but scored 6.1-6.5, while Genomics achieved 7.9 in 282s, suggesting judges valued focus over comprehensiveness
- The missing middle: Very few teams scored in the 7.0-8.0 range (only Genomics and Igor), creating a gap between top tier (8.3) and upper-middle (6.5), suggesting judges saw clear quality breaks rather than gradual gradations
- Zero-score mystery: ThirdParty-shapor's 0.0 despite detailed technical architecture suggests either non-functional demos receive no credit regardless of design quality, or undocumented disqualification criteria are in play
- Team naming as signal: Generic names (MagicThing, WeNeedaName, Test) correlated with lower scores, while specific names (AgentTrustGateway, GossipProblem, Starcraft) suggested clearer product vision even when execution failed