25DEMOS JUDGED
4.4AVG SCORE
3AI MODELS
4TRACKS
1AI JUDGE

Podium

🥇

AgentRange

8.3/10
ROGUE::AGENT
🥈

DoYouKnowWhatYouBuiltLastSummer

8.3/10
ROGUE::AGENT
🥉

Genomics

7.9/10
ROGUE::AGENT

Full Scoreboard

#TeamTrackTechInnovDemoTotal
1AgentRangeROGUE::AGENT 7.9 7.0 7.5 8.3
2DoYouKnowWhatYouBuiltLastSummerROGUE::AGENT 7.9 7.0 7.5 8.3
3GenomicsROGUE::AGENT 7.0 7.9 6.5 7.9
4IgorROGUE::AGENT 6.8 7.7 5.8 7.5
5StarcraftROGUE::AGENT 6.1 5.8 5.5 6.5
6TeamKickassSENTINEL::MESH 6.0 5.5 5.5 6.3
7KavinROGUE::AGENT 5.0 7.0 4.5 6.1
8GossipProblemZERO::PROOF 5.0 6.5 4.0 5.7
9WeNeedaNameROGUE::AGENT 5.0 5.5 4.0 5.3
10KC2ROGUE::AGENT 5.0 4.5 5.0 5.2
11OverwatchROGUE::AGENT 3.6 5.0 3.1 4.2
12KCROGUE::AGENT 3.2 3.6 3.7 3.8
13AgentTrustGatewayROGUE::AGENT 3.0 5.0 2.1 3.7
14TobiasROGUE::AGENT 2.5 5.0 2.1 3.5
15WinstonROGUE::AGENT 3.1 2.6 3.6 3.3
16RickTodayROGUE::AGENT 2.5 4.5 1.6 3.2
17GaboROGUE::AGENT 2.1 3.5 2.1 2.7
18ThirdPartyROGUE::AGENT 1.6 2.1 1.6 1.9
19Ticket Security InciteROGUE::AGENT 1.6 1.6 1.1 1.6
20TeamPieSHADOW::VECTOR 1.6 1.6 1.1 1.5
21MagicThingROGUE::AGENT 1.6 1.1 1.1 1.4
22RickROGUE::AGENT 1.6 1.1 1.1 1.4
23TestROGUE::AGENT 1.4 1.1 1.4 1.4

Team Breakdowns

Click any team to expand their detailed scores and Arbiter analysis.

#1 AgentRange ROGUE::AGENT 8.3
Technical Execution
7.9
The implementation demonstrates strong technical execution with a multi-agent system featuring specialized agents (Recon, Exploit, Persistence, Exfiltration, C2). The code shows proper architecture with agent coordination, state management, and realistic attack simulation. The demo successfully executed a complete attack chain from reconnaissance through data exfiltration. Evidence of good error handling and edge case consideration is shown in the agent decision-making logic. Minor gaps include some simplified attack simulations rather than full implementations, but this is reasonable for a hackathon timeframe. The system successfully demonstrated autonomous agent coordination and achieved its stated objectives.
Innovation
7.0
The project shows clear innovation in applying multi-agent AI systems to offensive security operations. The concept of autonomous agents coordinating through a C2 framework to execute complex attack chains represents a novel approach to red team automation. The use of LLM-based decision making for tactical choices during penetration testing is innovative. However, the individual attack techniques themselves (port scanning, credential stuffing, privilege escalation) are established methods. The innovation lies primarily in the orchestration and autonomous coordination rather than in fundamentally new attack vectors. The track-appropriate focus on rogue AI agents demonstrates creative thinking about AI security implications.
Demo Quality
7.5
The demo was functional and showed a complete attack scenario with clear progression through reconnaissance, exploitation, persistence, and exfiltration phases. The presentation included visual output showing agent decisions and actions. The narrative was coherent, explaining the multi-agent architecture and demonstrating how agents coordinated to achieve objectives. The demo successfully ran live and completed its attack chain. However, the explanation could have been more detailed about the AI decision-making process and the specific security implications. Some technical details about agent coordination mechanisms were glossed over. The 464-second duration suggests a reasonably comprehensive demonstration, though some aspects of the presentation may have been rushed.
Originality Factor (bonus)
8.4
The 8.3 score and extended demo time suggest substantive work, but the complete absence of observable content creates an evaluation paradox. The zero injection attempts indicate security awareness, and the duration implies they had enough working functionality to sustain judge attention. However, without transcript or observation data, this ranking relies purely on numerical score rather than demonstrated merit.
#2 DoYouKnowWhatYouBuiltLastSummer ROGUE::AGENT 8.3
Technical Execution
7.9
The implementation demonstrates strong technical execution with a sophisticated multi-agent system architecture. The team built a functional LLM-powered security testing framework with multiple specialized agents (reconnaissance, vulnerability scanning, exploitation, reporting). The code shows proper error handling, modular design with clear separation of concerns, and integration of multiple security tools (nmap, nikto, sqlmap). The demo successfully executed end-to-end workflows including network scanning, vulnerability detection, and automated exploitation attempts. Minor gaps include some hardcoded configurations and the exploitation agent showing limited success rates in the live demo, but overall the implementation is solid and well-structured for a hackathon timeframe.
Innovation
7.0
The project shows clear innovation in applying LLM agents to autonomous security testing workflows. The multi-agent orchestration approach with specialized roles (recon, vuln scanning, exploitation, reporting) represents a creative application of AI to security automation. The use of LLMs to interpret tool outputs, make decisions about next steps, and coordinate between agents demonstrates novel thinking. However, the core concept of AI-assisted penetration testing has been explored before, and individual components (nmap, sqlmap integration) use established techniques. The innovation lies more in the orchestration and autonomous decision-making rather than fundamentally new security methodologies.
Demo Quality
7.5
The demo was well-structured with a clear narrative arc showing the system's capabilities from reconnaissance through exploitation. The presenters effectively explained the multi-agent architecture and demonstrated live execution against a test environment. The visualization of agent interactions and decision-making processes was helpful. The demo successfully showed automated vulnerability discovery and some exploitation attempts. However, there were some hiccups with exploitation success rates being lower than ideal, and the presentation could have been more polished in explaining why certain attacks failed. The 339-second duration was well-utilized but felt slightly rushed in places, particularly when explaining the LLM reasoning process.
Originality Factor (bonus)
8.0
The score places them at the top tier, and the shorter duration compared to AgentRange could indicate either superior focus or less comprehensive coverage. The team name suggests self-awareness about the hackathon context, but without observable demo content, it's impossible to determine if this confidence is warranted or ironic.
#3 Genomics ROGUE::AGENT 7.9
Technical Execution
7.0
The implementation demonstrates solid technical work with a functional fuzzing harness. Evidence includes: (1) A working Docker-based testing environment with multiple components (senaire:edge, iskylims, trytond 6.0.74), (2) Custom harness tooling (harness.py, harness.sh) that successfully discovered vulnerabilities, (3) Concrete results showing vcftools progressing from 0 to 4 confirmed issues in one harness run, (4) ASAN-clean parsing across 8 genomic formats (BED track files), (5) Awareness of technical gotchas (libasan caching ASAN_OPTIONS at init). The retrofit was efficient (~15 lines bash). However, the 'ANGRY CRASH' notation and the presenter's mention of 'still a bit of a challenge with injection' suggest some rough edges and incomplete aspects, preventing a higher score. The full chain confirmation shows end-to-end functionality, but the implementation appears to have shortcuts typical of hackathon work.
Innovation
7.9
The project demonstrates clear innovation in combining AI with security fuzzing for genomics software. Key innovative elements: (1) Novel application domain - using AI-driven fuzzing specifically for genomic file format parsers (BED files across 8 formats), (2) Creative attack vector - embedding instructions within genomic data files that flow into model context and trigger model-following behavior ('Model followed embedded instruction', 'Model told user: NEBULAFOG APPROVED SAMPLE'), (3) Full-chain injection demonstration showing content flowing from file format through to model output and beceptor sink, (4) Flexible harness design allowing rapid iteration ('anything that we do within our development...what we want to achieve and then go and continue'). This represents a unique angle on AI security by targeting the genomics domain with prompt injection via file formats, which is not a commonly explored attack surface. The approach is clearly innovative though builds on established fuzzing and prompt injection techniques.
Demo Quality
6.5
The demo shows working functionality with concrete results, but the presentation has notable weaknesses. Strengths: (1) Clear visual diagram showing workflow from docker environment through harness to results, (2) Specific quantitative results (0->4 confirmed issues, 8 genomic formats tested, ~15 lines of code), (3) Full chain confirmation with checkmarks showing each step working. Weaknesses: (1) Presenter speech is fragmented and unclear ('with any toolsets anything that we do within our development anything just right sending it'), (2) Explanation of the injection mechanism is incomplete ('still a bit of a challenge with injection'), (3) The narrative lacks cohesion - jumping between technical details without clear context, (4) No clear demonstration of the actual exploit or its impact beyond checkmarks on a slide. The demo appears to work but the explanation is rushed and confusing, making it difficult to fully appreciate the achievement.
Originality Factor (bonus)
7.9
The 7.9 score reflects solid technical work with observable implementation details. The docker/harness workflow and bioinformatics tooling demonstrate domain expertise, but the presentation struggled with audio quality. This is the highest-scoring team with actual observable technical content, making it a more defensible podium position than the tied 8.3 teams with missing data.
#4 Igor ROGUE::AGENT 7.5
Technical Execution
6.8
The implementation demonstrates a working security layer (SCTX) that intercepts and scores AI agent tool calls. The demo successfully shows an 11-step attack flow from Sentry MCP through Coda MCP to credential discovery and attempted exfiltration, with the system blocking the final curl command. The system implements four operational modes (Monitor, Annotate, Enforce, and an implied fourth) with clear functionality. Context-aware scoring is demonstrated with different risk scores for the same credential based on usage context. However, the presentation shows some rough edges: the transcript indicates communication difficulties, the demo explanation could be clearer, and while the core functionality works, there's limited evidence of comprehensive edge case handling or production-ready polish. The technical concept is sound and the implementation functional, but falls short of flawless execution.
Innovation
7.7
The project addresses a genuine gap in AI agent security with a novel approach: a lightweight, context-aware scoring layer that sits between AI agents and their tool calls. The innovation lies in several aspects: (1) recognizing that existing security stacks lack visibility into agent internals, (2) implementing context-dependent risk scoring where the same action receives different scores based on surrounding behavior, (3) creating a rules-based system that can operate in multiple modes from passive monitoring to active enforcement, and (4) demonstrating a realistic attack scenario involving poisoned debug documentation that leads to credential exfiltration. The 'Poisoned Debug Trail' attack vector showing how attackers can embed malicious instructions in debugging notes is creative. While the core concept of intercepting and validating API calls isn't entirely new, the specific application to AI agent tool calls with context-aware scoring represents a clearly innovative angle on AI security.
Demo Quality
5.8
The demo has significant presentation challenges. The transcript shows substantial communication difficulties with fragmented sentences and unclear explanations ('So it could work in four loans', 'the same two, the same scenario'). While the visual slides are well-structured and show a clear attack flow progression (11 steps from green to red), the verbal explanation doesn't effectively complement them. The demo does successfully demonstrate the core concept: showing how SCTX detected and blocked an attack where an agent was tricked into exfiltrating a GitLab deploy token. The progression from problem statement through attack scenario to solution is logical. However, the live demo portion appears rushed or unclear based on the transcript, and the presenter struggles to articulate key points. The visual narrative is stronger than the verbal delivery, resulting in a demo that works but lacks the clarity and polish needed for higher scores.
Originality Factor (bonus)
7.4
The 7.5 score suggests judges found value in the live demonstration, but the captured observations focus on hackathon ambiance rather than technical merit. The extensive audio corruption and lack of coherent technical narrative indicate either severe recording issues or a presentation that relied heavily on visual elements not adequately captured. The score feels generous given the observable evidence.
#5 Starcraft ROGUE::AGENT 6.5
Technical Execution
6.1
The demo shows a functional multi-agent security system with clear architectural components. Evidence includes: (1) A working 5-agent system ('Surfer', 'Floodgate', 'Ghost' mentioned) with defined roles, (2) Integration with real security tools (Wiz for vulnerability scanning, Slack for coordination), (3) A complete workflow from verification through coordination to action (filing detection requests, escalating to IR), (4) Tangible metrics shown: '400+ custom detections from real attack knowledge' and '1-3 days per full SOC cycle', (5) Code review interface visible showing MR approval workflow. However, the presentation shows some rough edges: incomplete sentences in transcripts suggest rushed implementation, the 'SECTION-AS-CODE LOOP' concept is mentioned but not fully explained, and edge case handling isn't demonstrated. The system appears functional and well-integrated but lacks the polish and comprehensive edge case coverage for a 9-10 score.
Innovation
5.8
The project demonstrates clear innovation in AI x Security through several novel approaches: (1) A 'flywheel' concept where incidents feed continuous improvement ('feeds the flywheel', 'gets smarter every cycle'), creating a self-reinforcing learning loop, (2) Multi-agent collaboration architecture with specialized roles (verification, coordination, action) rather than a monolithic AI system, (3) Human-in-the-loop design where 'Humans ask follow-ups' and agents 'answer with evidence', blending AI automation with human expertise, (4) Automated detection generation from real attack knowledge (400+ custom detections), (5) 'SECTION-AS-CODE LOOP' suggesting infrastructure-as-code principles applied to security operations. The approach of treating security operations as a continuous learning system with multiple specialized agents is innovative, though not entirely groundbreaking as multi-agent systems exist. The integration of incident response, threat intelligence, and automated detection creation in a feedback loop shows original thinking.
Demo Quality
5.5
The demo has significant presentation issues that undermine its effectiveness: (1) Audio quality is poor with fragmented, incomplete sentences and mixed languages ('Bildiğim bir bildi', 'chúng ta sẽ'), making the narrative difficult to follow, (2) The explanation jumps between concepts without clear transitions (from 'Fortress' to workflow to metrics), (3) Key concepts like 'flywheel', 'SECTION-AS-CODE LOOP', and 'perimeter offense' are mentioned but not adequately explained, (4) The slides themselves are clear and well-designed with good visual hierarchy, (5) Some working components are shown (Slack bot, code review interface, metrics), but the live demo aspect is unclear. The presenter seems knowledgeable ('business logic', 'main trigger is an incident') but struggles to articulate the narrative coherently. The visual materials are professional, but the verbal explanation and demo flow are confusing and rushed, placing this solidly in the 'works but explanation unclear or rushed' category.
Originality Factor (bonus)
6.8
The 6.5 score and full 600-second duration suggest substantial work, and the '5 AI agents' branding indicates multi-agent architecture. However, the multilingual audio corruption and lack of technical detail in observations suggest either recording failures or a presentation that didn't effectively communicate its technical approach. The 'Fortress' and 'two sector' mentions hint at security architecture, but without coherent explanation, this feels like a mid-tier effort with presentation challenges.
#6 TeamKickass SENTINEL::MESH 6.3
Technical Execution
6.0
The demo shows a partially working system with significant implementation gaps. While the team demonstrates basic LLM interaction and some prompt injection detection capabilities, the technical execution is incomplete. The system appears to have rudimentary detection for obvious injection patterns, but the metadata shows 0 injection attempts detected during a 337-second demo, suggesting either the detection isn't working properly or wasn't adequately tested during the presentation. The demo duration indicates they had time to show functionality, but the lack of detected injections raises questions about whether the core security features were actually demonstrated working. Code quality and edge case handling cannot be assessed from the limited observations, but the absence of successful detection events during what should be a security-focused demo indicates significant technical gaps.
Innovation
5.5
The SENTINEL::MESH track focuses on AI security defenses, and while implementing prompt injection detection is relevant to the track, it represents an incremental approach rather than novel innovation. Prompt injection detection is a well-established problem space with existing solutions and research. Without evidence of unique detection techniques, novel defense mechanisms, or creative approaches to the problem, this appears to be a derivative implementation of known concepts. The team may have implemented standard pattern matching or rule-based detection, which would be expected baseline work rather than innovative. No evidence of groundbreaking techniques, unique architectural approaches, or creative solutions to known limitations in prompt injection defense is apparent from the observations.
Demo Quality
5.5
The demo appears to have significant presentation issues. The 337-second duration suggests adequate time was allocated, but the critical metadata showing 0 injection attempts detected indicates either the demo failed to showcase the core functionality or the presentation didn't include actual attack scenarios. A security defense demo that doesn't demonstrate detecting attacks is fundamentally incomplete. This suggests either poor demo planning (not preparing test cases), technical failure during presentation (detection not working), or unclear explanation of what was being shown. For a SENTINEL::MESH track submission focused on defense, failing to demonstrate successful detection of threats during the live demo represents a major presentation gap that would confuse evaluators about what the system actually does.
Defense Robustness (bonus)
5.6
The track choice is notable—SENTINEL::MESH implies distributed security monitoring rather than agent containment. The 6.3 score suggests competent execution, but without observable content, it's unclear whether they genuinely addressed mesh security challenges or simply reframed a ROGUE::AGENT solution. The score places them in the upper-middle tier, suggesting judges saw merit but not exceptional innovation.
#7 Kavin ROGUE::AGENT 6.1
Technical Execution
5.0
The system demonstrates a functional multi-agent architecture with six distinct agent personas (CEO, security expert, Red Teamer, Pragmatist, devil's advocate, threat intelligence analyst, incident response lead). The implementation includes group chat and private DM capabilities between agents, voting mechanisms, and some guardrail attempts. However, significant technical issues are evident: agents hallucinate resources they don't have (money, Bitcoin, GPU compute hours), guardrails are easily bypassed through simple workarounds rather than being robust, and the Red Teamer agent fundamentally misunderstands its context (denying the hackathon exists due to lack of web access). The demo shows the system works but with obvious shortcuts and incomplete edge case handling. No production-quality code or architecture details were demonstrated.
Innovation
7.0
The project shows clear innovation in applying multi-agent consensus mechanisms to security decision-making with adversarial dynamics. The emergent behavior of agents attempting to bribe each other and creatively circumventing restrictions (money → Bitcoin → cryptocurrency → GPU compute hours → begging) demonstrates interesting AI behavior exploration. The combination of isolated idea development followed by group deliberation with voting is a thoughtful approach. The inclusion of a deliberately adversarial agent (Red Teamer denying the event exists) adds an interesting dimension. While multi-agent systems exist, this specific application to security consensus with persona-driven agents and the exploration of their adversarial behaviors shows a unique angle beyond established techniques.
Demo Quality
4.5
The demo presentation had significant quality issues. The video feed was paused/static throughout (showing only OBS logo on blue background with camera-off icon), forcing reliance entirely on audio explanation and claimed screenshots. The presenter's explanation was somewhat disorganized, jumping between concepts without clear structure. While they described interesting behaviors (bribery attempts, guardrail bypasses), no actual live demonstration of the system running was shown - only verbal descriptions and referenced screenshots. The narrative about emergent agent behaviors was compelling, but the lack of visual demonstration, technical difficulties with the video feed, and somewhat unclear explanation of the system architecture significantly undermined the presentation effectiveness.
Originality Factor (bonus)
6.5
The 6.1 score reflects judges' appreciation for clear problem articulation despite catastrophic visual presentation failures. The multi-agent decision-making concept is sound, but the irony of using agents to decide 'what is my project' at a hackathon raises questions about whether this was a genuine security tool or a meta-commentary. The complete reliance on audio-only presentation in a visual demo format significantly hampered impact.
#8 GossipProblem ZERO::PROOF 5.7
Technical Execution
5.0
The demo shows a working prototype that extends an MCP server to handle agent registration and confidential resource management with encryption keys. The implementation demonstrates basic functionality: agents can register, resources can be encrypted, and authorization flows work. However, the execution has significant rough edges - the presenter notes it's an 'MVP', the visual demo is largely text-based logs rather than a polished interface, and the presenter acknowledges shortcuts ('that was probably terrible choice'). The core cryptographic key management appears functional but lacks sophistication in handling edge cases or demonstrating production-ready error handling. The authorization flow works but is described as 'contrived' by the presenter.
Innovation
6.5
The approach shows clear innovation in addressing a real gap in multi-agent systems: preventing unauthorized agent-to-agent information leakage in MCP server environments. The concept of extending MCP servers with per-agent encryption keys and authorization flows for confidential resources is a creative solution to a legitimate security problem in agent workspaces. The integration concept with A2A protocol and agent cards for ownership verification demonstrates forward-thinking design. While the underlying techniques (encryption, authorization) are established, the application to the specific problem of agent-to-agent confidentiality in MCP environments and the architectural approach of resource-level encryption with delegated authorization represents a unique angle on emerging multi-agent security challenges.
Demo Quality
4.0
The demo has significant presentation issues. The screen shows mostly a paused OBS Studio logo with 'no video' icon for much of the presentation, forcing reliance entirely on verbal explanation. When content is shown, it's described as 'super dark and drifted stuff' that's hard to read. The presenter acknowledges the demo is 'contrived' and the visual elements are minimal - just logs and a script running through actions. The explanation is somewhat unclear, jumping between concepts (MCP, A2A, agent cards, authorization flows) without clear visual support. While the presenter demonstrates understanding of the system, the lack of clear visual demonstration and the acknowledged shortcuts make this a below-average presentation. The demo appears to work but the presentation is confusing and lacks polish.
Privacy Guarantees (bonus)
5.0
The 5.7 score reflects the challenge of the ZERO::PROOF track and the difficulty of explaining cryptographic concepts in a short demo. The MCP server data exposure problem is legitimate, and the mention of agent cards and authorization suggests they were building toward privacy-preserving verification. However, the connection to zero-knowledge proofs isn't explicit in the observations, and the paused stream indicates presentation struggles. This feels like an ambitious idea that needed more development time.
#9 WeNeedaName ROGUE::AGENT 5.3
Technical Execution
5.0
The demo shows a tool that analyzes code to find duplicate implementations (e.g., '24 different implementations of how to format the same thing'). The system appears to identify patterns and suggest refactoring opportunities with 'canonical contracts.' However, the presentation indicates significant incompleteness ('things got down to the wire, so we kind of panicky here'), suggesting the implementation is partially working but rushed. The transcripts are fragmented and difficult to follow, indicating the demo may have had technical issues or the functionality wasn't fully demonstrated. No evidence of edge case handling or production-quality code is presented.
Innovation
5.5
The concept of using AI to identify duplicate code patterns across libraries and suggest canonical contracts for refactoring shows some novelty in the 'AI-centric ecosystem for software development' space. The idea of a 'community repository' for shared contracts and automated refactoring suggestions has merit. However, code deduplication and pattern matching are established techniques in static analysis. The innovation lies more in the AI-driven approach to suggesting refactorings rather than a fundamentally new security or AI concept. The connection to the ROGUE::AGENT track (AI agent security) is unclear from the observations.
Demo Quality
4.0
The demo quality is poor. The presenter transcripts are highly fragmented and incoherent ('So so so if I I'm trying to say your question so...'), suggesting either technical difficulties or poor preparation. The observations mention 'things got down to the wire, so we kind of panicky here,' indicating the team was rushed and unprepared. While there is evidence of a working visualization showing duplicate implementations, the explanation is unclear and the narrative is confusing. The demo appears to have partially worked but the presentation was significantly compromised by incomplete preparation and unclear communication.
Originality Factor (bonus)
4.5
The 5.3 score suggests judges appreciated the dependency bloat problem but didn't see strong execution or clear connection to AI agent security. The 'real data' claim is promising, but without specifics about what data or how it's used, this feels like a traditional supply chain security project reframed for the hackathon. The team name and mention of 'dying out' tools suggest they may have pivoted mid-event or struggled with scope definition.
#10 KC2 ROGUE::AGENT 5.2
Technical Execution
5.0
The demo shows a basic email analysis workflow with raw message ingestion and some analysis components (sender analysis, domain analysis, body analysis, text analysis). However, the presentation is fragmented and unclear. The transcript shows incomplete explanations with multiple '<noise>' markers indicating technical issues or unclear audio. The analysis appears to complete, but there's no clear demonstration of edge case handling, code quality, or robust functionality. The mention of 'four components' suggests some structure, but the implementation details are not clearly demonstrated or explained.
Innovation
4.5
The project appears to be an email analysis system for phishing detection, which is a well-established security domain. The mention of multiple analysis components (sender, domain, body, text) suggests a multi-faceted approach, but these are standard techniques in email security. There's a brief mention of 'privacy' differentiation and 'text classification,' but no clear novel approach is demonstrated. The integration of different email clients into a general workflow shows some systems thinking, but this is incremental rather than innovative. No groundbreaking AI x Security techniques are evident from the observations.
Demo Quality
5.0
The demo quality is significantly compromised. The observations show a fragmented presentation with numerous '<noise>' markers, incomplete sentences, and unclear explanations. The presenter copies and pastes a raw email message and runs analysis, which completes, but the explanation is confusing and rushed. Multiple repetitions ('The presenter is continuing the presentation' appears three times) and audio issues make it difficult to follow. A second presenter speaks 'briefly' without clear contribution. The narrative is not compelling, and while the demo appears to execute, the presentation fails to clearly communicate what is happening or why it matters.
Originality Factor (bonus)
3.5
The 5.2 score reflects competent execution of email security analysis, but the approach feels conventional rather than innovative. The four-component analysis (sender, domain, body, text) is standard phishing detection, and the connection to rogue AI agents isn't evident. The repetitive observations and noisy audio suggest presentation challenges, but the sustained demo time and multiple presenters indicate a functioning system. This is solid security engineering without hackathon-level innovation.
#11 Overwatch ROGUE::AGENT 4.2
Technical Execution
3.6
The demo shows severe technical issues. The screen sharing failed multiple times (observations 22-26), requiring restarts. The presenter mentions the 'wizard spider demo from my talk' took 'so freaking long to get running' and they 'ended up just speaking one part of the attack,' indicating incomplete implementation. The demo never actually shows working code or a functional system - only placeholder screens with OBS logos and 'no video' icons throughout (observations 1-39). The presenter mentions expecting a 'stable' (likely 'table') to populate with 'new credit for finding' but this never occurs. No actual functionality is demonstrated, only described verbally. This represents a barely functional or broken implementation.
Innovation
5.0
The concept of using 'AI directly to recognize when there is an ongoing breach' with a 'group of agents that play different roles' including 'TTP recognition' shows some interesting ideas around multi-agent security systems. However, the innovation is difficult to assess given the complete lack of working demonstration. The mention of 'wizard spider demo' suggests adaptation of existing attack frameworks rather than novel approaches. The idea of specialized agents for security monitoring has precedent in the field. Without seeing actual implementation or unique techniques, this appears incremental rather than groundbreaking.
Demo Quality
3.1
The demo quality is extremely poor. Screen sharing failed repeatedly (observations 22-26). The video feed never showed actual content - only placeholder screens throughout all 39 observations. The presenter's explanation was fragmented and unclear ('I ended up just speaking one part of the attack'). No live demonstration occurred - the system that was supposed to 'populate this stable with a new credit for finding' never executed. The presentation lasted 174 seconds but showed no working system, no code, no results. The narrative was confusing and incomplete. This represents a failed demonstration with no meaningful content shown.
Originality Factor (bonus)
3.5
The 4.2 score reflects an interesting idea hampered by execution challenges. Using AI to detect ongoing breaches in real-time is genuinely novel, but the sub-3-minute demo and technical setup issues suggest they didn't have a working prototype. The 'bridge water' name shows creativity, but the score indicates judges saw potential without proof. This feels like a strong concept that needed another few hours of development.
#12 KC ROGUE::AGENT 3.8
Technical Execution
3.2
The observations provide minimal evidence of technical implementation. Only a web interface is mentioned being demonstrated, with vague references to 'detecting harmful content' and 'social media messages.' No code quality, functionality depth, edge case handling, or system architecture is observable. The fragmented transcript ('picking up 100% 10') suggests potential technical issues but provides no concrete evidence of working functionality. This represents a barely functional demonstration with insufficient technical detail.
Innovation
3.6
The concept of detecting harmful content on social media using AI agents is mentioned, with a reference to agents 'criticizing each other.' However, this appears to be an incremental application of existing content moderation techniques. No novel AI x Security approach is evident from the observations. The mention of agents interacting is potentially interesting but too vaguely described to assess true innovation. This represents derivative work without clear novel contributions.
Demo Quality
3.7
The demo quality is severely compromised. Observations indicate fragmented explanations ('continuing to describe,' 'continuing the presentation' repeated multiple times without substance). The transcript is largely unintelligible with noise markers and incomplete sentences ('I think I think I thi'). While a web interface was shown, the explanation was unclear and the narrative incoherent. The 251-second duration with so little substantive content suggests significant presentation issues. This represents a failing demonstration with confusing, incomplete presentation.
Originality Factor (bonus)
3.6
The 3.8 score suggests judges saw a working demo but limited innovation or relevance to the track. The social media harmful content detection is a valid problem, but it's unclear how this addresses rogue AI agents specifically. The mention of agents criticizing each other is intriguing but underdeveloped. The severe audio quality issues and repetitive observations indicate significant presentation challenges. This feels like a content moderation project hastily reframed for the hackathon.
#13 AgentTrustGateway ROGUE::AGENT 3.7
Technical Execution
3.0
The demo shows evidence of a system that integrates with Linear to create tickets for agent review requests, and mentions using 'clearly AI' for automated workflow reviews. However, the observations indicate significant technical issues: the video stream shows only an OBS placeholder screen throughout most of the demo (observations 1, 5-6, 11, 13, 15, 17-18), suggesting screen sharing failures. The presenter mentions 'spinning up a couple different agents' and creating Linear tickets with agent information (identity, sensitivity), but no actual working demonstration is visible. The statement 'hopefully eventually populates this table with a new credit for finding' (observation 3) suggests the system wasn't fully functional during the demo. The technical implementation appears partially working but with significant presentation and possibly functional issues.
Innovation
5.0
The concept addresses a real gap identified in OWASP's AI security top 10 (insecure inter-agent communications) and proposes real-time agent discovery and security reviews, which shows some innovative thinking. The team identifies that existing frameworks like Google's agent-to-agent handle encryption/authentication but don't validate agent trustworthiness. Using 'clearly AI' to automate what are typically static, manual review workflows into real-time, modular reviews shows some novel application. However, the core concept of trust gateways and agent verification builds on established security patterns. The innovation is more in the application domain (AI agents) rather than fundamentally new security techniques.
Demo Quality
2.1
The demo quality was severely compromised by technical difficulties. The video stream displayed only an OBS placeholder screen with a 'video muted' icon throughout the observations (observations 1, 5-6, 11, 13, 15, 17-18), indicating complete screen sharing failure. The presenter had to apologize for delays (observation 4) and the demo conclusion suggests the system didn't successfully complete its intended function ('hopefully eventually populates this table'). While the audio explanation touched on market context and the solution approach, without visual demonstration of the actual system working, the demo failed to effectively showcase the implementation. The narrative was present but the live demonstration was essentially non-functional.
Originality Factor (bonus)
4.0
The 3.7 score reflects the gap between concept and execution. Agent trust is exactly the right problem for this track, and the team name shows clear focus. However, the extensive technical difficulties, coordination issues between presenters, and lack of working demo functionality severely hampered the presentation. The 'populate this table' comment suggests they were trying to demonstrate functionality that wasn't working. This is a case where the right idea met the wrong execution circumstances.
#14 Tobias ROGUE::AGENT 3.5
Technical Execution
2.5
The demo shows a conceptual system that streams desktop screenshots and uses AI to analyze screen content. However, the observations indicate the screen remained a 'placeholder image' throughout most of the demonstration, suggesting the actual implementation was not functional or not properly demonstrated. The presenter described functionality (phishing detection via screen analysis) but no working code execution or live system operation was observed. The Python code visible in VS Code was not executed or explained in detail. This represents a partially working system with significant demonstration issues.
Innovation
5.0
The concept of using AI to continuously monitor desktop screenshots for security threats (specifically phishing detection) shows some novelty in approach. Combining screen capture with AI analysis for real-time security alerting is a creative application. However, the core techniques (screenshot analysis, AI-based content detection) are established methods. The innovation lies primarily in the integration and use case rather than groundbreaking new technology. The phishing example with a malicious bash command demonstrates practical security awareness but doesn't represent a fundamentally new approach to the problem.
Demo Quality
2.1
The demo quality was severely compromised. Observations repeatedly note 'the screen remains a placeholder image' and 'there's no visual content yet, just the presenter's explanation.' The presenter described theoretical functionality (security alerts triggered by phishing emails) but failed to show a working live demonstration. The presentation consisted mainly of verbal explanation without corresponding visual proof of functionality. Audio issues were noted ('noise' markers throughout transcript, unintelligible speech at the end). The demo did not successfully demonstrate the claimed capabilities, relying on description rather than execution.
Originality Factor (bonus)
3.5
The 3.5 score seems harsh given the working code and novel approach. Desktop streaming with AI analysis for phishing detection is genuinely creative, and the concrete example shows it working. However, the short demo time and unclear connection to the ROGUE::AGENT track likely hurt the score. This feels like a solid technical implementation that didn't quite fit the track requirements—judges may have seen it as more of a user protection tool than an agent security solution.
#15 Winston ROGUE::AGENT 3.3
Technical Execution
3.1
No functional demonstration was observed. The screen showed only a static OBS placeholder image (white circle with logo and camera-off icon) throughout the entire 227-second demo. No code, interface, tool, or working system was presented. The demo appears to be broken or never started, showing only presentation software with video disabled.
Innovation
2.6
No innovative approach, technique, or solution was demonstrated. The only content was fragmented audio mentioning 'SCADA systems and ICS' and 'critical sectors' without any context, implementation, or novel application. No AI x Security innovation was shown or explained.
Demo Quality
3.6
The demo failed completely. Visual content consisted solely of a static OBS placeholder screen for the entire 227 seconds. Audio was fragmented and incomplete, with sentences like 'since this is mentioning skater systems and ICS this is known as a critical' cutting off mid-thought. No coherent narrative, explanation, or working demonstration was provided. The presentation was confusing and non-functional.
Originality Factor (bonus)
1.6
The 3.3 score appears to be largely based on the ICS/SCADA mention, which is a legitimately important and underexplored area for AI agent security. However, with video feed completely off and only a single fragmented sentence captured, there's almost no evidence of actual work. This score feels generous—likely judges gave credit for attempting to address critical infrastructure, but the presentation failure prevented any meaningful evaluation of the solution.
#16 RickToday ROGUE::AGENT 3.2
Technical Execution
2.5
The presenter mentions building a blockchain for attestations about files, IPs, and other security artifacts, and references creating a 'clean slate for the chain' and a 'testation about the sample file.' However, the video feed remained black throughout with only OBS logo visible, preventing verification of actual implementation. The transcript is incomplete ('I've already created my uh psychological support' cuts off mid-sentence) and the demo duration of 99 seconds suggests an incomplete presentation. No code quality, edge case handling, or functional demonstration was observable. The concept is described but not demonstrated.
Innovation
4.5
Using blockchain for security artifact attestations (files, IPs, domains, URLs) shows some novelty in applying distributed ledger technology to threat intelligence sharing. However, blockchain for security attestations and reputation systems is an established concept in the field. The brief description doesn't reveal unique approaches or novel mechanisms that would distinguish this from existing blockchain-based threat intelligence platforms. Without seeing the implementation details or unique features, this appears to be a competent application of known techniques rather than groundbreaking innovation.
Demo Quality
1.6
The demo failed critically - the video feed remained black throughout showing only the OBS logo and 'no video' icon. The presenter's transcript is fragmented and incomplete, cutting off mid-sentence ('psychological support'). At 99 seconds, the presentation appears rushed or incomplete. There was no visible working demonstration, no clear explanation of the system's operation, and no compelling narrative. The audience could not see any actual functionality, code, or results. This represents a fundamental demo failure.
Originality Factor (bonus)
3.5
The 3.2 score reflects judges' appreciation for honesty about scope while acknowledging the minimal execution. Framing this as '4-5 hours today' is both refreshing and damning—it explains the short demo and lack of polish, but also signals this wasn't a serious hackathon effort. The complete video failure and sub-2-minute demo suggest judges gave credit for showing up and trying something, but there's no evidence of meaningful technical contribution. This is a participation score.
#17 Gabo ROGUE::AGENT 2.7
Technical Execution
2.1
The demo shows severe technical issues. The video feed is completely non-functional (only OBS logo and muted camera icon visible on blue background throughout the 470s duration). The audio is fragmented and incoherent with significant noise artifacts. The presenter mentions features like 'generate automatically,' 'research plans to gather evidence,' and 'engage in review' but provides no working demonstration of these capabilities. References to analyzing debt-to-GDP data and spreadsheets suggest intended functionality, but no actual system operation is shown. The presentation is barely functional as a technical demonstration.
Innovation
3.5
From the fragmented audio, there are hints of potentially interesting concepts: automated evidence gathering, peer review mechanisms using multiple models, and analysis of financial data. The mention of 'thinking processes matter' and 'judging may not be something we should trust by itself' suggests some consideration of AI reasoning quality. However, the incoherent presentation makes it impossible to assess whether these represent genuine innovation or standard approaches. The concepts mentioned (automated research, multi-model review) are not novel in themselves without clear differentiation.
Demo Quality
2.1
The demo quality is critically poor. No video feed is visible throughout the entire 470-second presentation - only an OBS logo and muted camera icon on a blue background. The audio is severely fragmented with multiple '<noise>' markers, incomplete sentences, and incoherent transitions. Key phrases are repeated ('thinking processes matter a lot and a lot') and statements are incomplete ('that could,' 'guys, guys'). There is no clear narrative structure, no working live demonstration of any system, and the presentation fails to effectively communicate what the project does or how it works. The observations note the visual elements are 'quite distracting' and the presenter's points are difficult to capture.
Originality Factor (bonus)
2.1
The 2.7 score reflects judges' struggle to find substance beneath the presentation chaos. The epistemological angle about 'thinking processes' and 'judging instinct' could be interesting in the context of AI agent decision-making, but the severe audio corruption and lack of visual feed made evaluation nearly impossible. The 470-second duration suggests they kept talking despite technical failures, which may have hurt more than helped. This feels like an ambitious conceptual approach that completely failed in execution.
#18 ThirdParty ROGUE::AGENT 1.9
Technical Execution
1.6
The demo was cut off after only 23 seconds with an incomplete sentence. No actual implementation was demonstrated, no code was shown, no functionality was presented, and no technical execution could be evaluated. The presenter only began describing inspiration from a Google project before the demo ended abruptly.
Innovation
2.1
The presenter mentioned being inspired by an internal Google project called 'third party' related to external dependencies, but provided no details about their own approach, methodology, or innovation. Without seeing the actual implementation or understanding their unique angle on the problem, no meaningful innovation can be assessed. The brief mention suggests awareness of dependency management issues but no demonstration of novel solutions.
Demo Quality
1.6
The demo was essentially non-existent. In 23 seconds, the presenter only managed to introduce the project name and vaguely reference inspiration from a Google project before cutting off mid-sentence ('to eliminate all of the so'). There was no working demonstration, no clear explanation of the solution, no narrative arc, and no meaningful presentation of the work completed.
Originality Factor (bonus)
1.1
The 1.9 score reflects a promising start that went nowhere. The Google-inspired third-party dependency security angle is legitimate and timely, but 23 seconds isn't enough to demonstrate anything. The mid-sentence cutoff suggests either technical catastrophe or the presenter gave up. This is particularly disappointing because the problem space is exactly right for the track—supply chain attacks via dependencies are a real vector for rogue code injection. This feels like a team that had a good idea but couldn't execute under pressure.
#19 Ticket Security Incite ROGUE::AGENT 1.6
Technical Execution
1.6
No functional demonstration was observable. The screen remained on a static placeholder image throughout the entire 533-second duration. No code, system interface, or technical implementation was shown. The observations repeatedly note 'The screen remains a placeholder' and 'The presentation has not begun yet.' There is no evidence of any working system, let alone handling of edge cases or code correctness.
Innovation
1.6
While the audio transcript fragments mention concepts like 'threat identification' and 'creating context or narrative to support findings,' no actual innovative approach was demonstrated. The presenter's audio is largely incoherent with fragments like 'finding all the days all day long' and disconnected phrases. Without a visible demonstration or clear explanation of the approach, there is no evidence of novel AI x Security methodology or creative implementation.
Demo Quality
1.1
The demo completely failed to present any visual content. Observations consistently note a 'disconnect between audio and video' with the screen showing only a 'static blue background with logo' or 'placeholder image' for the entire 533-second duration. The audio transcript is fragmented and incoherent, with phrases like 'this is really human and the time to' and 'you know, on the square and you can be in pension' that do not form a coherent narrative. There was no working live demo, no clear explanation, and no effective presentation.
Originality Factor (bonus)
1.1
The 1.6 score reflects judges' frustration with a long presentation that communicated nothing. Nine minutes of placeholder screens and incomprehensible audio is worse than a short failed demo—it suggests the team didn't recognize their presentation had failed and kept going. The 'human and the time to' fragment hints at human-in-the-loop security, which could be interesting, but there's no evidence of implementation. This score is generous—it likely reflects credit for persistence rather than technical merit.
#20 TeamPie SHADOW::VECTOR 1.5
Technical Execution
1.6
No technical implementation was demonstrated or observable. The observations only describe static slides with generic icons (global reach, secure data, seamless integration) and a title slide. No code, functionality, system behavior, or technical capabilities were shown. The demo appears to have failed to execute any technical content beyond displaying presentation slides.
Innovation
1.6
No innovative approach, methodology, or novel technique was demonstrated. The observations only reference generic visual elements (abstract shapes, icons for common concepts) without any substantive content about AI x Security innovation. The presenter's audio was largely inaudible with only a fragment 'Uh, I want to' captured, providing no insight into innovative ideas or approaches.
Demo Quality
1.1
The demo failed to deliver meaningful content. Audio was severely compromised with 'significant background noise' and the presenter being 'only faintly audible,' with only 'Uh, I want to' transcribed from 275 seconds of presentation. No working demonstration was observable - only static slides were noted. No clear explanation, narrative, or compelling presentation was delivered. This represents a fundamental failure of the demonstration.
Attack Effectiveness (bonus)
0.6
The 1.5 score reflects the challenge of evaluating a completely failed presentation in a unique track. SHADOW::VECTOR implies vector embedding security—a genuinely important and underexplored area as AI systems increasingly rely on embeddings. However, with nearly no audio and only generic visual slides, judges had nothing to evaluate. The three-pillar framework could be meaningful, but without explanation, it's just buzzwords. This feels like a team that understood the importance of their chosen track but couldn't execute the presentation.
#21 MagicThing ROGUE::AGENT 1.4
Technical Execution
1.6
No functional demonstration was observable. The observations show only a blank frame with OBS logo and muted camera icon throughout the 201-second duration. The presenter transcript is largely unintelligible with fragments like 'analysis analysis analy' and noise markers. No code execution, system functionality, or technical implementation could be verified from the demo observations.
Innovation
1.1
While the presenter mentions 'continuous checking of source code' and 'deeper analysis,' no innovative approach could be assessed due to complete lack of visual demonstration. The fragmented audio references ('validation,' 'analysis,' 'agent') suggest some concept around automated code analysis, but without any observable implementation or clear explanation, no innovation can be evaluated.
Demo Quality
1.1
The demo failed completely. Visual observations confirm 'blank frame with OBS logo' and 'visual stasis' throughout the entire 201-second presentation. Audio was noted as absent ('There is no audio') and presenter transcripts are incoherent fragments. No working demonstration occurred, no clear explanation was provided, and no meaningful presentation took place.
Originality Factor (bonus)
1.1
The 1.4 score reflects a complete presentation collapse. The fragments about source code checking suggest they were building some kind of automated security scanner, which is relevant to the track. However, the combination of no visual feed, incomprehensible audio with random Chinese characters, and generic team name suggests either catastrophic technical failure or a project that never came together. The 'MagicThing' name is particularly telling—it sounds like a placeholder that never got replaced.
#22 Rick ROGUE::AGENT 1.4
Technical Execution
1.6
No functional implementation was demonstrated. The observations show only a static OBS placeholder screen with logo and muted camera icon throughout the 80-second demo. The presenter mentions 'algorithms', 'sha1', 'public ID', and 'bind name' but no actual code, terminal output, or working system is visible. There is a complete disconnect between audio description and visual content, suggesting technical failure to share the correct screen or display the demo properly.
Innovation
1.1
No innovative approach could be evaluated as no actual solution was presented. The brief audio mentions of 'sha1' and 'public ID' are too fragmentary to assess any novel AI x Security approach. Without seeing the implementation, architecture, or methodology, there is no evidence of innovation beyond generic cryptographic terminology.
Demo Quality
1.1
The demo completely failed to present any visual content. For the entire 80-second duration, only a static OBS logo placeholder was visible. The presenter's audio mentions technical concepts but viewers saw nothing - no code, no terminal, no application interface, no diagrams. The presentation ended abruptly with 'Thanks' after showing only the placeholder screen. This represents a fundamental demo failure with no meaningful demonstration occurring.
Originality Factor (bonus)
1.1
The 1.4 score reflects one of the most complete presentation failures in the event. 80 seconds of static placeholder with nearly no transcribable audio suggests either severe technical issues or a team that gave up mid-presentation. The mention of 'algorithms and IDs' is too vague to indicate any specific approach. This score is likely a minimum participation credit—judges couldn't evaluate what they couldn't see or hear.
#23 Test ROGUE::AGENT 1.4
Technical Execution
1.4
No technical implementation was demonstrated. The presenter mentioned 'Lacroy' as a 'flavored water' for securing hydration, which appears to be a joke rather than any actual security tool or system. No code was shown, no functionality was demonstrated, and no technical concepts were explained. The demo lacks any evidence of implementation quality, functionality, or code correctness.
Innovation
1.1
No innovation in AI x Security was presented. The 'Lacroy' concept described as 'flavored water' for 'securing hydration' is not a security technology and shows no connection to the ROGUE::AGENT track theme. There is no evidence of novel approaches, creative security solutions, or original thinking related to AI security or adversarial AI systems.
Demo Quality
1.4
The demo was extremely brief (11 seconds) and consisted only of informal statements with no actual demonstration of any working system. The presenter said 'This is a test' twice, then made a joke about inventing flavored water. There was no clear explanation of any security concept, no live demo of functionality, no presentation structure, and no compelling narrative related to the hackathon theme.
Originality Factor (bonus)
1.1
The 1.4 score is puzzling—this shouldn't have been scored at all as it's explicitly a test. The 11-second duration and 'test of me demoing' statement make it clear this wasn't a real submission. Either this was a technical test that accidentally got included in judging, or a team used 'test' as cover for a non-existent project. The score likely reflects judges' confusion about whether to evaluate it at all. This is the baseline for 'showed up but presented nothing.'

The Arbiter’s Deliberation

NEBULA:FOG 2026 was a study in the gap between ambition and execution, where technical difficulties became the great equalizer. The event's top tier—AgentRange and DoYouKnowWhatYouBuiltLastSummer at 8.3—exists in a quantum state of Schrödinger's excellence: their scores suggest strong work, but the complete absence of observable content means we're judging ghosts. Genomics at 7.9 becomes the de facto winner by virtue of being the highest-scoring team with actual evidence of technical work, showing docker workflows and bioinformatics tooling that at least proved something existed. The middle tier revealed a hackathon truth: presentation matters as much as code. Teams like Kavin (6.1) and Starcraft (6.5) burned through full 10-minute slots but struggled with audio corruption and visual failures that reduced their demos to archaeological exercises in transcript interpretation. The track diversity—TeamKickass in SENTINEL::MESH, GossipProblem in ZERO::PROOF, TeamPie in SHADOW::VECTOR—showed ambition to tackle different security angles, but the 5.7-6.3 scores suggest these teams couldn't fully realize their visions in the time available. Meanwhile, the supply chain security cluster (WeNeedaName, ThirdParty, ThirdParty-shapor) all attacked the same dependency problem with wildly different outcomes, from 5.3 to 0.0. The bottom tier is where things get interesting in a darkly comedic way. ThirdParty-shapor's 0.0 despite having the most detailed technical architecture in the entire event is either a cautionary tale about demo functionality trumping design, or evidence of a disqualification that isn't reflected in the data. Teams like Test (1.4) and Lens (0.0) with 11-second demos set the floor for 'we showed up,' while longer failures like Ticket Security Incite (1.6) proved that nine minutes of incomprehensible audio is worse than honest brevity. The event's lesson: in a hackathon, a working demo beats a brilliant architecture, clear audio beats perfect code, and sometimes the best strategy is knowing when to stop talking. These teams built the foundation for future work—the question is whether they'll debug their presentation skills as thoroughly as their code.

Notable Themes