Autonomous Pentesting
Benchmark Report 2026
We ran an autonomous pentest on a live app, then measured it against the field. One public target, independent ground truth, every result backed by 31,400 logged telemetry events and independently validated.
Free · 25-minute read · Sent straight to your inbox
Autonomous
Pentesting
Benchmark
One live target. Independent validation.
Full run telemetry.
Results at a glance
01 · What we ran
One target. Independent ground truth.
Over June 10 to 11, 2026, Strobes AI ran a fully autonomous assessment of Fider v0.33.0, a production-grade open-source feedback platform with real authentication, file uploads, webhooks, OAuth, and an admin console. The security firm Doyensec independently assessed the same application and published validated results for two commercial AI security platforms. That shared, third-party reference is what makes the comparison checkable.
Every platform was evaluated against the same target and ground truth, within the same testing window and the same evaluation standard. Findings were independently validated, false positives independently reviewed, and results deduplicated before comparison.
01
A real, public target
Fider is a production-grade multi-tenant SaaS, open source, so anyone can stand up the same instance and reproduce the work — not a CTF box or a deliberately vulnerable training app.
02
Independent third party
Doyensec assessed the exact application and published validated results for two commercial AI platforms. The AI platform figures are Doyensec's, not ours.
03
Every result from telemetry
The cost basis is per-layer: every run agent, tool call, reasoning step, and credit was logged — 31,400 events in total. The figures are computed directly from that trace.
02 · The marquee result
189s to verified admin takeover
A confirmed multi-step attack chain, executed end to end with zero human intervention. The combined six-scanner field confirmed zero exploitable findings on the same target.
OTP endpoint, no rate limit
Recon surfaces an unprotected one-time-password endpoint — the entry into the admin account.
Code brute-forced
The admin session is captured in 189 seconds against the unprotected code.
Session verified
The captured genuine admin session is confirmed and replayed.
Pivot to webhooks
Continuous admin access pivots into the webhook functionality.
SSRF
Blind SSRF → AWS IMDS. Outbound webhook reaches the cloud metadata service, exposing instance credentials.
03 · The comparison
Measured against six scanners and two AI pentesting platforms
All figures are post-validation and deduplicated. The AI platform numbers come from Doyensec's independent study of the same target. Lower false-positive count is better.
| Source | Validated | False positives | Exploitable |
|---|---|---|---|
| Six scanners (deduplicated) | 13 | 14 | 3 |
| AI platform · Aikido | 17 | 4 | — |
| AI platform · XBOW | 26 | 1 | — |
| Strobes AI | 45 | 0 | 37 |
The six-scanner field deduplicated overlapping findings to 13 validated, with 14 false positives, and confirmed 3 as exploitable. The autonomous platforms flagged probable issues but did not confirm exploitability.
04 · The economics
70 to 75% lower cost per validated finding
The figures below are per validated finding. Scanner pricing reflects public list rates; the Strobes figure is measured AI-credit consumption on the same target. The cost bases differ, so read the comparison as directional even though the gap is large.
13 findings · $4,000 / scan
24 findings · $4,000 / scan
45 findings · ~$1.1k credits
Based on public scanner pricing and measured Strobes AI credit consumption on the same target.
Why the gap is structural, not tuning
A signature scanner emits fixed candidates. Strobes AI authenticates, holds session state, chains weaknesses into a shared exploit, reasons about business logic, and routes each result to a confirm-or-discard step — so every candidate runs through a separate validation agent that must confirm it before it is recorded. The credit is the mechanism behind the zero false positives.
From the benchmark to your stack
From quarterly pentest to continuous validation
If autonomous assessment costs what this benchmark shows, your release cadence becomes the only limit. Talk to us about running this against your own target — network, API, cloud, or source-level review in one workspace.