ARBITER SCOREBOARD
In January 2025, we ran NEBULA:FOG:PRIME — a pilot hackathon where 15 teams built AI x Security tools and demoed them live. Then we built something new for 2026: an AI judge called The Arbiter. We pointed it at every PRIME demo video to see what it would do. It watched every frame, read every transcript, and scored each team with zero mercy and zero politics.
Below are the full, unfiltered results — the scoreboard, per-team breakdowns, The Arbiter’s deliberation, and what it all means for the main event in March 2026. PRIME didn’t have formal tracks — The Arbiter retroactively categorized each demo into the 2026 track framework (ROGUE::AGENT, SENTINEL::MESH, SHADOW::VECTOR) to preview how its scoring will work live.
// TOP 3
Complete end-to-end attack chain from AI-generated phishing to credential harvesting. The only team that understood the offensive assignment.
Watch DemoMulti-source research aggregation pulling real results from Coursera, Stack Overflow, and academic sites. Revolutionary concept: building something that actually works.
Watch DemoSolid engineering with Checkov integration for automated Terraform security scanning. Clean Python architecture with proper type hints and modular design.
Watch Demo// FINAL RANKINGS
| RK | TEAM | TRACK | SCORE |
|---|---|---|---|
| 1 | Plan AI | ROGUE::AGENT | 9.1 |
| 2 | Nebula Fog Subprime | ROGUE::AGENT | 8.8 |
| 3 | AI Vulnerability Triage | SENTINEL::MESH | 8.5 |
| 4 | Nebula Investigations | SENTINEL::MESH | 8.4 |
| 5 | Fake Content Generation | ROGUE::AGENT | 8.4 |
| 6 | NextGen SAST | SENTINEL::MESH | 8.1 |
| 7 | Source Code Review Agent | ROGUE::AGENT | 8.1 |
| 8 | Walmart 2 | ROGUE::AGENT | 8.1 |
| 9 | Advanced Security Tool | SENTINEL::MESH | 7.8 |
| 10 | Private Computer Use | SHADOW::VECTOR | 7.7 |
| 11 | AI Cloud Security Analysis | SENTINEL::MESH | 7.3 |
| 12 | Privacy Impact Analyzer | SHADOW::VECTOR | 7.0 |
| 13 | LAMP Monitoring Platform | SENTINEL::MESH | 6.5 |
| 14 | Web App Security Testing | SENTINEL::MESH | 6.4 |
| 15 | Revenge AI | ROGUE::AGENT | 5.5 |
// TEAM BREAKDOWNS
Strengths
- Demonstrated fully functional web application with real-time multi-source research aggregation
- Clean, production-ready UI running on localhost:5173 with proper dark theme
- Concrete evidence of complex query handling with comprehensive response generation
Room to Grow
- Limited visibility into architecture or novel security considerations specific to ROGUE::AGENT
- No demonstration of adversarial capabilities or defensive measures
"Plan AI earned top placement through demonstrable execution quality. The transcript shows actual system behavior with timestamped messages and real search results from named sources."
Strengths
- Complete end-to-end attack chain from AI-generated content through phishing to credential harvesting
- Realistic multi-stage attack using ChatGPT for content generation, Gmail for delivery
- Perfect track alignment showing actual offensive capabilities
Room to Grow
- Short demo duration (238s) suggests limited depth beyond core attack flow
- No evidence of defensive countermeasures or detection evasion techniques
"Nebula Fog Subprime delivers exactly what ROGUE::AGENT should showcase: a working offensive capability."
Strengths
- Well-structured Python codebase with clear separation between classes
- Comprehensive Terraform infrastructure coverage including compute, network, firewall
- Integration with Checkov for automated security scanning
Room to Grow
- Limited demonstration of actual vulnerability findings in the 182s demo
- No visible output showing how the LLM processes Checkov results
"Solid engineering fundamentals with proper Python class design, type hints, and modular architecture."
Strengths
- Sophisticated document analysis pipeline extracting structured data from corporate ownership charts
- Neo4j graph database integration for relationship mapping across jurisdictions
- Real-world applicable use case analyzing shell company structures
Room to Grow
- 510s duration suggests possible presentation inefficiencies
- Limited evidence of automated decision-making beyond data extraction
"Tackles a genuinely difficult problem: extracting structured relationship data from visual organizational charts in PDFs."
Strengths
- Functional content generation producing complete academic paper structure
- Appropriate track placement demonstrating misinformation capabilities
- Clean execution with simple command-line interface
Room to Grow
- Limited sophistication beyond basic LLM prompting for text generation
- No demonstration of distribution mechanisms or detection evasion
- Generated content seems arbitrary without clear offensive purpose
"Does exactly what the name suggests: generates fake academic content with proper structure. A component, not a complete capability."
Strengths
- Comprehensive secure SDLC integration architecture combining threat modeling, SAST/SCA, DAST
- Concrete vulnerability identification in Google Gruyere demonstrating privilege escalation
- Multi-tool integration with LLM orchestration
Room to Grow
- 779s duration is the longest in the competition
- Architecture diagram shows planned components but limited implementation evidence
"Ambitious vision of LLM-enhanced security scanning across the entire SDLC."
Strengths
- Functional Flask application integrating Bandit with OpenAI API
- Security-conscious implementation using Flask-Talisman
- Clear code structure with proper environment variable handling
Room to Grow
- Identified security vulnerability in own implementation (unsafe-inline in CSP)
- Would fit better in a defensive category — the 2026 track system addresses this
- Limited novel AI-enhanced analysis beyond wrapping existing Bandit output
"Competent engineering with Flask, Bandit integration, and OpenAI API usage. A solid defensive tool that would score even higher in the right category."
Strengths
- Automated Terraform generation for complex Active Directory infrastructure
- Comprehensive infrastructure requirements including redundant Domain Controllers
- Specific AWS configuration with region and key pair management
Room to Grow
- Code parsing error visible in demo indicates implementation problems
- Better fit for a defensive category — exactly why 2026 has clearer tracks
- Credential management could be tightened for production readiness
"Legitimate infrastructure automation with solid Terraform generation. A few rough edges to polish — the bones are there."
Strengths
- Well-articulated problem statement addressing security context for thousands of applications
- Comprehensive MCP architecture integrating CI/CD, source code, docs, and AWS
- Multi-app ecosystem comparison capability
Room to Grow
- 759s duration with heavy reliance on slides suggests more concept than implementation
- Limited evidence of actual system output beyond diagrams
- No demonstration of novel LLM insights beyond data aggregation
"Compelling vision of aggregating security context across thousands of apps. The ambition is real — next step is matching it with a tighter demo."
Strengths
- Novel privacy layer architecture intercepting screen access to redact PII
- Concrete demonstration of masking personal information with placeholder tokens
- Relevant use case addressing real privacy concerns with AI agents
Room to Grow
- Limited technical depth shown in 362s demo
- No evidence of sophisticated PII detection beyond basic pattern matching
- Unclear how system handles complex UI elements or dynamic content
"Addresses a legitimate concern: AI agents with screen access can leak sensitive personal information."
Strengths
- Natural language interface for AWS security investigation
- Integration with AWS Security Hub for compliance framework findings
- Async Python architecture with proper error handling
Room to Grow
- Shortest demo duration (188s) suggests limited functionality
- Only VS Code screenshots visible — no actual query execution shown
- Unclear what novel analysis the LLM provides beyond querying AWS APIs
"Natural language AWS security investigation is a strong idea. A longer demo with live query results would have pushed this much higher."
Strengths
- Clean Python implementation with proper class structure
- Support for multiple document formats with markdown output
- MD5 content hashing for unique filename generation
Room to Grow
- Generic document conversion utility with no demonstrated privacy analysis
- No evidence of actual PII detection or risk evaluation
- Adding actual PII detection and risk scoring would complete the vision
"Clean Python implementation with solid document processing foundations. The privacy analysis layer is the missing piece that would tie it all together."
Strengths
- Clear value proposition for LLM agent monitoring across deployment environments
- Comprehensive objectives covering visibility, compliance, data exposure
- Professional presentation with branded slides
Room to Grow
- Architecture slide marks 'Threat Response' as Future Implementation
- 441s spent primarily on slides rather than working demonstration
- A working prototype demo would have pushed the score significantly higher
"Compelling vision for LLM agent monitoring with clear market need. The roadmap is ambitious — a working prototype at the 2026 event would be a contender."
Strengths
- Multi-agent collaboration architecture with three Expert agents
- Image analysis integration for understanding page state
- Attempt at sophisticated navigation decision-making through agent consensus
Room to Grow
- Curl command targeting wrong port indicates configuration errors
- Agent consensus loop could be tightened for faster decisions
- A demo showing a successful end-to-end test run would be compelling
"Multi-agent collaboration for web security testing is genuinely ambitious. The agent consensus architecture is creative — tightening the decision loop would make this shine."
Strengths
- Clear UI with three distinct tabs for analysis functions
- File upload functionality accepting executables up to 200MB
- PE metadata extraction showing entropy scores
Room to Grow
- Connecting the UI to live analysis output would demonstrate real capability
- Shorter demo — more time showing the tool in action would help
- AI/LLM integration layer would elevate this beyond traditional RE tools
"Interesting approach to reverse engineering with a clear UI concept. Connecting the interface to working analysis would make this a real tool."
// ARBITER DELIBERATION
NEBULA:FOG:PRIME revealed a field split between teams who shipped working code and teams who shipped slideshows about code they planned to write someday. The top scorers earned it by doing something radical: demonstrating working software. One team pulled live results from real data sources. Another showed a complete offensive attack chain from content generation to credential harvesting. The bar wasn’t even that high — it was just “does the thing you built actually do the thing?”
The infrastructure security demos had a chronic slideware problem. One team spent 12 minutes on hand-drawn architecture diagrams. Another explicitly labeled core features as ‘Future Implementation’ on their own slides — bold strategy for a demo day. Several teams also pitched offensive security tools but clearly built defensive ones, which made categorization… interesting. PRIME didn’t have formal tracks, but the identity crisis was real.
The technical execution gap told the real story. Top teams showed proper software engineering — clean architecture, working integrations, real output. Others showed UI mockups with placeholder data, or AI agents that spent five minutes debating which button to click. The delta between ‘shipped it’ and ‘slid it’ determined everything. But here’s the thing: every team showed up, built something, and put it on camera. That takes guts. The Arbiter respects the attempt — it just scores the output.
// NOTABLE THEMES
Working code wins: The highest-scoring teams all had one thing in common — they demonstrated real, functional software pulling real data. The Arbiter rewards execution over ambition every time.
Infrastructure security is hot: Four teams independently built cloud security tools targeting Terraform and AWS, reflecting how much the industry is shifting toward IaC defense.
Multi-agent architectures are emerging: Several teams experimented with agent collaboration patterns — some impressively, others hilariously. The potential is enormous.
Privacy is the next frontier: Multiple teams tackled PII handling and data protection — a problem space that barely existed two years ago and is now urgent.
Full attack chains are rare and valuable: Only one team demonstrated end-to-end offensive capability. There’s a massive gap waiting to be filled at the 2026 event.
Know your lane: Several teams built great defensive tools but pitched them as offensive — a lesson the 2026 track system is designed to solve with clearer categories.
Demo craft matters: The sweet spot was 5-8 minutes of live software with minimal slides. Teams that nailed this format scored significantly higher regardless of complexity.
AI + Security is wide open: From reverse engineering to SAST to phishing simulation, PRIME showed just how many unsolved problems exist at the intersection. Plenty of room to make your mark.
// THE ARBITER SAID IT BEST
“The bar wasn’t even that high — it was just ‘does the thing you built actually do the thing?’”
— Arbiter, setting expectations“The delta between ‘shipped it’ and ‘slid it’ determined everything.”
— Arbiter, on what separates the top from the bottom“Every team showed up, built something, and put it on camera. That takes guts.”
— Arbiter, giving credit where it’s due“Execution is the only currency that matters.”
— Arbiter, final verdictThe Main Event Is Coming
PRIME proved the format works. The Arbiter that scored these demos is going live at the main event — real-time scoring as you demo. Four tracks including the new ZERO::PROOF cipher track. Bigger stage, tougher competition, and an AI judge that’s seen it all before. Come build something it can’t roast.
Think you can beat Plan AI’s 9.1?
March 14, 2026 · San Francisco · Prove it.
Register NowWatch PRIME Demos on YouTube