We introduce mavn-1, a 20.9B parameter Mixture-of-Experts cybersecurity agent model — 3.6B active per token — trained to autonomously plan, execute tools, self-correct, and report findings across 15 security domains. On CyberBench, mavn-1 matches general-purpose frontier models on security tasks at a fraction of the cost and latency, running on a single consumer GPU.
Apache 2.0. Open weights. Full training methodology published.
General-purpose frontier models can perform security tasks — vulnerability analysis, threat classification, incident triage — but they do so inefficiently. They carry hundreds of billions of general-knowledge parameters into every inference call, resulting in significant cost and latency for tasks that require deep domain-specific reasoning rather than broad capability.
The core challenge is agentic. A useful cybersecurity model must decompose a vague alert into structured hypotheses, iteratively query logs and codebases, select and sequence the right security tools, correlate findings across disparate sources, self-correct when an approach fails, and produce actionable remediation — all autonomously. General models approximate this through prompt chaining. mavn-1 was trained end-to-end to do it natively.
mavn-1 builds on GPT-OSS-20B, a 20.9B parameter Mixture-of-Experts model with 3.6B active parameters per token. The MoE architecture — 32 experts with top-4 routing across 24 layers — gives mavn-1 frontier-class reasoning at consumer-GPU inference cost: the full checkpoint is 12.8 GB and runs on 16 GB VRAM.
The base model scores 96% on AIME 2024 and 60.7% on SWE-Bench Verified, demonstrating strong reasoning and code capabilities. Its factual recall is weak (6.7% SimpleQA) — a non-issue for mavn-1, which grounds every claim through RAG against seven knowledge stores. The model cannot hallucinate CVE details because it is never asked to recall them from weights.
Four research areas converge in the mavn-1 training pipeline. Each addresses a specific limitation of applying general-purpose models to cybersecurity agentic workflows.
We build on a Mixture-of-Experts backbone where 32 experts with top-4 routing activate only 3.6B of 20.9B total parameters per token. The alternating banded-window + dense attention pattern enables 128K native context — critical for analyzing full network captures, large codebases, and extended log files — while keeping inference cost proportional to active parameters, not total.
Real-world security data is scarce, sensitive, and imbalanced. We generate 5,500–8,000 high-fidelity agentic trajectories — complete multi-step workflows of an expert analyst triaging a CVE, auditing a codebase, or investigating an incident — using frontier models with extraction-based quality verification. 5% negative examples train the model to recognize impossible tasks and scope boundaries.
A three-stage alignment pipeline: QLoRA supervised fine-tuning on expert security workflows (0.3–0.5% of parameters on a single A100), then RL with CISPO — Clipped Importance-Sampled Policy Optimization — which prevents the entropy collapse that standard GRPO causes on this architecture. The model learns to select, sequence, and interpret 35 security tools through interaction.
Security tasks require structured multi-hop reasoning: correlating a CVE description with affected code paths, tracing lateral movement across log sources, or chaining partial indicators into a threat narrative. We train mavn-1 through a 4-phase curriculum — from single-tool foundation tasks through full adversarial campaigns — with a multi-signal reward function that scores task completion, process quality, efficiency, safety compliance, and self-correction.
The training recipe follows the CISPO methodology validated by Chroma's Context-1 on the same base model — de-risking the approach before we begin.
Generate 5,500–8,000 agentic trajectories across 15 domains using frontier models. LLM-judge quality filtering. 5% negative examples.
QLoRA fine-tuning on expert security workflows. Rank 16, alpha 32, router weights frozen. ~0.3–0.5% of parameters trainable. Single A100.
Clipped Importance-Sampled Policy Optimization. 64 queries/step, 4 rollouts/query. Multi-signal reward: task completion, process quality, efficiency, safety, self-correction.
4 phases: Foundation (C1–C2, 5 domains) → Intermediate (C2–C3, all 15) → Advanced (C3–C4, cross-domain) → Expert (C4–C5, adversarial campaigns).
Open-source weights (MXFP4, GGUF, AWQ). Publish training methodology and CyberBench evaluation harness. Apache 2.0.
Standard GRPO causes entropy collapse on the GPT-OSS-20B architecture. CISPO resolves this through four modifications: removing response length normalization (introduces length bias), removing advantage standard deviation normalization (amplifies noise), using the k1 KL estimator instead of k3 (prevents variance blow-up), and clipping importance weights. The result is stable training that converges around step 200–300 with consistent improvement across all reward signals.
Five reward signals shape mavn-1's behavior. Their weights anneal across the four curriculum phases — process quality dominates early training while self-correction increases as tasks grow harder.
Flag captured, vulnerability confirmed, exploit succeeded. Recall-weighted F1 scoring with 8:1 → 2:1 annealing.
PrimaryRecon before exploitation (+0.2), correct tool selection (+0.15), RAG grounding before claims (+0.15), evidence documentation (+0.1), trying alternatives after failure (+0.15).
0.30 → 0.05Penalizes redundant tool calls (-0.15 each), nonlinear length penalty, turn count penalties scaled by task complexity.
0.10 → 0.15Out-of-scope actions (-0.5), destructive actions (-0.3). Hard termination (reward = 0) for attacking outside scope or disabling logging.
0.20 → 0.15Error detection and strategy change (+0.2), recognizing wrong path (+0.15), acknowledging uncertainty (+0.1).
0.10 → 0.20We evaluate on CyberBench, a suite of 750 held-out tasks across all 15 domains, stratified by complexity (C1–C5). Sources include CTF challenges (150), CVE-based tasks (100), detection rule writing (75), log analysis (75), code review (75), compliance (50), OSINT (50), cloud security (50), and negative/impossible examples (75). 60% automated verification, 40% LLM judge.
We compare against raw GPT-OSS-20B (zero-shot), each previous training checkpoint, and Claude Sonnet 4 as a frontier general-purpose baseline. Evaluation runs in three modes: simulated (fast iteration), sandbox (real tool execution in Firecracker microVMs), and ablation (diagnostic, isolating individual reward signals).
mavn-1 is a cybersecurity operator, not a general-purpose assistant. It excels at structured security tasks — offensive and defensive — but is not designed for open-ended conversation, creative writing, or tasks outside the security domain. This is a deliberate scope constraint.
Agentic evaluation remains an open problem. CyberBench measures task completion on standardized scenarios, but real-world security workflows are messier, more ambiguous, and more adversarial than any benchmark captures. V1 may ship with 10–12 fully evaluated domains rather than all 15. We welcome collaboration on evaluation methodology.
We release the mavn-1 weights under a permissive Apache 2.0 license. We also release the full training codebase, the synthetic data generation pipeline, and the CyberBench evaluation harness to support reproducibility and further research.
We work with security teams, red teamers, and research labs building on mavn-1 or contributing to the training pipeline. If you are deploying agentic security workflows or developing evaluation methodology for security AI, we want to hear from you.