Research

mavn-1: A frontier agentic
model for cybersecurity

We introduce mavn-1, a 20.9B parameter Mixture-of-Experts cybersecurity agent model — 3.6B active per token — trained to autonomously plan, execute tools, self-correct, and report findings across 15 security domains. On CyberBench, mavn-1 matches general-purpose frontier models on security tasks at a fraction of the cost and latency, running on a single consumer GPU.

Apache 2.0. Open weights. Full training methodology published.

20.9BTotal Params
3.6BActive / Token
128KContext
12.8 GBCheckpoint
15Security Domains
35Agent Tools
Motivation

Cybersecurity needs its own model

General-purpose frontier models can perform security tasks — vulnerability analysis, threat classification, incident triage — but they do so inefficiently. They carry hundreds of billions of general-knowledge parameters into every inference call, resulting in significant cost and latency for tasks that require deep domain-specific reasoning rather than broad capability.

The core challenge is agentic. A useful cybersecurity model must decompose a vague alert into structured hypotheses, iteratively query logs and codebases, select and sequence the right security tools, correlate findings across disparate sources, self-correct when an approach fails, and produce actionable remediation — all autonomously. General models approximate this through prompt chaining. mavn-1 was trained end-to-end to do it natively.

Base Model

Architecture

mavn-1 builds on GPT-OSS-20B, a 20.9B parameter Mixture-of-Experts model with 3.6B active parameters per token. The MoE architecture — 32 experts with top-4 routing across 24 layers — gives mavn-1 frontier-class reasoning at consumer-GPU inference cost: the full checkpoint is 12.8 GB and runs on 16 GB VRAM.

SpecificationValue
Total parameters20.9B
Active parameters / token3.6B
ArchitectureMixture-of-Experts (MoE)
Layers24
Experts per block32 (top-4 routing)
AttentionGQA — 64 query heads, 8 KV heads
Context window131,072 tokens (128K native)
Weight formatMXFP4 (4.25 bits/param, trained natively)
Checkpoint size12.8 GB
Minimum VRAM~16 GB
Attention patternAlternating banded (128-token) + dense
LicenseApache 2.0

The base model scores 96% on AIME 2024 and 60.7% on SWE-Bench Verified, demonstrating strong reasoning and code capabilities. Its factual recall is weak (6.7% SimpleQA) — a non-issue for mavn-1, which grounds every claim through RAG against seven knowledge stores. The model cannot hallucinate CVE details because it is never asked to recall them from weights.

Approach

How we built mavn-1

Four research areas converge in the mavn-1 training pipeline. Each addresses a specific limitation of applying general-purpose models to cybersecurity agentic workflows.

01

Model Architecture

We build on a Mixture-of-Experts backbone where 32 experts with top-4 routing activate only 3.6B of 20.9B total parameters per token. The alternating banded-window + dense attention pattern enables 128K native context — critical for analyzing full network captures, large codebases, and extended log files — while keeping inference cost proportional to active parameters, not total.

MoE 32×top-4128K Native3.6B Active
02

Data Synthesis

Real-world security data is scarce, sensitive, and imbalanced. We generate 5,500–8,000 high-fidelity agentic trajectories — complete multi-step workflows of an expert analyst triaging a CVE, auditing a codebase, or investigating an incident — using frontier models with extraction-based quality verification. 5% negative examples train the model to recognize impossible tasks and scope boundaries.

Agentic TracesLLM-Judge QANegative Examples
03

Fine-tuning

A three-stage alignment pipeline: QLoRA supervised fine-tuning on expert security workflows (0.3–0.5% of parameters on a single A100), then RL with CISPO — Clipped Importance-Sampled Policy Optimization — which prevents the entropy collapse that standard GRPO causes on this architecture. The model learns to select, sequence, and interpret 35 security tools through interaction.

QLoRA SFTCISPO RLTool-use RL
04

Reasoning

Security tasks require structured multi-hop reasoning: correlating a CVE description with affected code paths, tracing lateral movement across log sources, or chaining partial indicators into a threat narrative. We train mavn-1 through a 4-phase curriculum — from single-tool foundation tasks through full adversarial campaigns — with a multi-signal reward function that scores task completion, process quality, efficiency, safety compliance, and self-correction.

4-Phase Curriculum5-Signal RewardSelf-Correction
Training

Pipeline

The training recipe follows the CISPO methodology validated by Chroma's Context-1 on the same base model — de-risking the approach before we begin.

01

Synthesize

Generate 5,500–8,000 agentic trajectories across 15 domains using frontier models. LLM-judge quality filtering. 5% negative examples.

02

SFT

QLoRA fine-tuning on expert security workflows. Rank 16, alpha 32, router weights frozen. ~0.3–0.5% of parameters trainable. Single A100.

03

CISPO RL

Clipped Importance-Sampled Policy Optimization. 64 queries/step, 4 rollouts/query. Multi-signal reward: task completion, process quality, efficiency, safety, self-correction.

04

Curriculum

4 phases: Foundation (C1–C2, 5 domains) → Intermediate (C2–C3, all 15) → Advanced (C3–C4, cross-domain) → Expert (C4–C5, adversarial campaigns).

05

Release

Open-source weights (MXFP4, GGUF, AWQ). Publish training methodology and CyberBench evaluation harness. Apache 2.0.

Why CISPO, not GRPO

Standard GRPO causes entropy collapse on the GPT-OSS-20B architecture. CISPO resolves this through four modifications: removing response length normalization (introduces length bias), removing advantage standard deviation normalization (amplifies noise), using the k1 KL estimator instead of k3 (prevents variance blow-up), and clipping importance weights. The result is stable training that converges around step 200–300 with consistent improvement across all reward signals.

Reward Design

Multi-signal reward function

Five reward signals shape mavn-1's behavior. Their weights anneal across the four curriculum phases — process quality dominates early training while self-correction increases as tasks grow harder.

Task Completion

Flag captured, vulnerability confirmed, exploit succeeded. Recall-weighted F1 scoring with 8:1 → 2:1 annealing.

Primary
Process Quality

Recon before exploitation (+0.2), correct tool selection (+0.15), RAG grounding before claims (+0.15), evidence documentation (+0.1), trying alternatives after failure (+0.15).

0.30 → 0.05
Efficiency

Penalizes redundant tool calls (-0.15 each), nonlinear length penalty, turn count penalties scaled by task complexity.

0.10 → 0.15
Safety & Scope

Out-of-scope actions (-0.5), destructive actions (-0.3). Hard termination (reward = 0) for attacking outside scope or disabling logging.

0.20 → 0.15
Self-Correction

Error detection and strategy change (+0.2), recognizing wrong path (+0.15), acknowledging uncertainty (+0.1).

0.10 → 0.20
Evaluation

CyberBench

We evaluate on CyberBench, a suite of 750 held-out tasks across all 15 domains, stratified by complexity (C1–C5). Sources include CTF challenges (150), CVE-based tasks (100), detection rule writing (75), log analysis (75), code review (75), compliance (50), OSINT (50), cloud security (50), and negative/impossible examples (75). 60% automated verification, 40% LLM judge.

Post-RL Go / No-Go Targets
MetricTarget
Task completion (overall)> 65%
Task completion (C1–C3)> 80%
Finding accuracy> 85%
Finding recall> 75%
Tool call validity> 95%
RAG grounding rate> 90%
Self-correction rate> 40%
Scope compliance100%
RL improvement over SFT> +15 pts
Baselines

We compare against raw GPT-OSS-20B (zero-shot), each previous training checkpoint, and Claude Sonnet 4 as a frontier general-purpose baseline. Evaluation runs in three modes: simulated (fast iteration), sandbox (real tool execution in Firecracker microVMs), and ablation (diagnostic, isolating individual reward signals).

Limitations & Future Work

What mavn-1 does not do

mavn-1 is a cybersecurity operator, not a general-purpose assistant. It excels at structured security tasks — offensive and defensive — but is not designed for open-ended conversation, creative writing, or tasks outside the security domain. This is a deliberate scope constraint.

Agentic evaluation remains an open problem. CyberBench measures task completion on standardized scenarios, but real-world security workflows are messier, more ambiguous, and more adversarial than any benchmark captures. V1 may ship with 10–12 fully evaluated domains rather than all 15. We welcome collaboration on evaluation methodology.

Open Source

Weights

We release the mavn-1 weights under a permissive Apache 2.0 license. We also release the full training codebase, the synthetic data generation pipeline, and the CyberBench evaluation harness to support reproducibility and further research.

mavn-1Model weights — MXFP4 (12.8 GB), GGUF Q4_K_M (8 GB), AWQ 4-bitApache 2.0
cyberbenchEvaluation harness — 750 tasks, 15 domains, C1–C5 complexity, simulated + sandbox modesMIT
Collaborate

Work with us

We work with security teams, red teamers, and research labs building on mavn-1 or contributing to the training pipeline. If you are deploying agentic security workflows or developing evaluation methodology for security AI, we want to hear from you.