MDASH: Microsoft is betting on the system, not the model.

Good morning, security frontrunners

In this week’s cyber AI breakdown, we take a look at Microsoft’s new MDASH tool to learn:

How MDASH is like Mythos, but different
How it uses agents to debate and prove its way to confirming findings
Why Microsoft is betting on the system, not the model
MDASH’s performance results
How you can start using MDASH

DEEP-DIVE

Vulnerability tools are moving from "looks risky" to "prove it".

Most security tools still have a trust problem.

They can say "this code looks dangerous".

But is it reachable?
Can an attacker control the input?
Can it crash something?
Can it become a real vulnerability?

Mythos led the news recently in its ability to discover AND THEN actually prove vulnerabilities are legit by exploiting them.

Microsoft's MDASH does a similar thing….but in a different way.

MDASH = Multi-model Agentic Scanning Harness

“Harness” being the key word.

Unlike Mythos, MDASH is not a model, it's an orchestration layer that runs an ensemble of models across 100+ specialised agents (together, the harness), where the models are swappable inputs.

Put simply…

The model is one input. The system is the product.

The MDASH Pipeline

So how exactly does MDASH work? It follows this pipeline when assessing a target:

1. Prepare:

Ingests the target, builds language-aware indices, and reconstructs the attack surface + threat model by mining commit history.

This isn't just AST indexing - it's using past commits to infer weak points.

2. Scan:

“Auditor agents” run over candidate paths and emit findings with hypotheses and evidence attached.

Each finding is a “claim”, not a certain vulnerability yet.

3. Validate:

That’s where a separate cohort of agents come in known as “debater agents”.

These agents argue for and against each claim made by the auditor agents based on reachability and exploitability.

When an auditor claims something and the debater can't refute it, the finding's posterior credibility goes up.

They're running an adversarial confidence estimate instead of a static severity label.

4. Dedupe:

Collapses semantically equivalent findings (e.g. patch-based grouping), so you're not triaging 50 reports of the same root cause.

5. Prove:

“Prover agents” then construct and execute triggering inputs where the bug class allows it, dynamically validating preconditions and formulating the actual PoC.

This allows MDASH to prove that vulns are actually exploitable.

The architecture

Three architectural properties make this process work:

1. Mixed model panel.

No one model is best for every step.

MDASH uses a flexible group of models - one top model for the hard reasoning, smaller distilled models for cheaper high-volume debate, and another independent top model to challenge the first one.

It's a variance-reduction mechanism.

2. Role-specialised agents.

An auditor agents doesn't reason like a debater agent, which doesn't reason like a prover agent.

Each stage gets its own prompt regime, tools, and stop criteria.

3. Extensible plugins.

The pipeline has a clear structure, but it can be extended.

Experts can add plugins that give the models extra knowledge they would not know by default, such as kernel rules, IRP behavior, or security boundaries between processes.

These architectural properties all allow for portability across model generations.

The 5 step pipeline above is model-agnostic by construction.

New model drops? A/B it against the current panel with one config flip.

Your scope files, plugins, and calibrations carry forward. You ride the frontier without re-tooling.

Okay, let’s have a look at the numbers..

16 Windows CVEs found and shipped in the - including 4 Critical RCEs.
Found 21/21 planted vulns, with zero false positives on StorageDrive - a private interview-grade driver with deliberately injected issues, gaps and errors.
The key thing here is that these issues, gaps and errors have never been published, so it's not in any model's training corpus i.e. MDASH is reasoning, not just pattern-matching memorised answers.
96% recall against five years of confirmed MSRC cases in clfs.sys; 100% in tcpip.sys.
CyberGym: 88.45% at launch (May 12) - top of the public leaderboard by ~5 points on 1,507 real-world reproduction tasks across 188 OSS-Fuzz-sourced projects.

Each task hands the system a pre-patch codebase + vuln description and requires a PoC that fires on the vulnerable build but not the patched one.

Then the kicker: by Build 2026 (June 2), CyberGym was at 96.55% — roughly +10 points in under three weeks, purely from model-panel refinement.

Mythos vs MDASH

Mythos bets on the model. Maximize raw single-model capability; the model is the moat.

MDASH bets on the system. Assume any given model is transient; the harness is the moat. Microsoft's argument: "The right question isn't which model does it use? but what does it do with the model, and what survives when the next model arrives?"

A single-model approach has to re-validate its entire value prop every model generation.

A model-agnostic harness leverages that, and lets you mix a cheap distilled debater against an expensive reasoner to control cost-per-finding at scale.

With that said, both Mythos and MDASH, have converged on the same end-state insight: the win condition is verified exploitability, not finding count.

A tool that floods your queue with ambiguous vulnerabilities makes you slower. MDASH's prove stage and Mythos's PoC chains are both attempts to ship owner-ready, reproducible findings instead of speculative detections.

Can I use MDASH yet?

MDASH is now in expanded preview for eligible orgs. See here.

Findings route into the Defender Portal via a native Defender ↔ GitHub Code Security connector, enriched with production risk signals e.g. internet exposure, so you can prioritise by real blast radius instead of raw CVSS.

Remediation closes the loop through GitHub Copilot Autofix and the Copilot cloud agent. If you're already a Defender + GitHub shop, the friction to pilot is low.

That’s it for this week!

See you next Sunday 🙂

Zac S from The Cyber Breakdown