Back to Blog
Agentic Pentesting with Strobes AI — 32 tasks, 21 WSTG phases, 42 confirmed vulnerabilities, fully autonomous

Agentic Pentesting with Strobes AI

Prakash AshokMarch 25, 20269 min read

Ask any pentester what kills an engagement, and it's rarely the technical difficulty. It's the clock. You scope the target, fire up Burp, start crawling, find something interesting on endpoint #3, spend forty minutes confirming it, write it up, then realize you've burned half a day and haven't touched authentication testing yet.

We've all been there. The coverage gaps aren't from lack of skill; they're from the simple fact that one person can only type one command at a time.

That's the problem our Research & Engineering team set out to solve with Strobes AI. Not by replacing pentesters, but by giving them something they've never had: persistent, multi-agent workspaces that can run an entire OWASP WSTG assessment autonomously while you focus on the parts that actually need a human brain.

How Does "Strobes Agentic AI" Better Pentests?

The term gets thrown around a lot, so let's be specific. When we say agentic pentesting, we mean AI agents that don't just answer your questions about CVEs or suggest payloads. They actually execute. They run the scan, read the response, decide what to try next, confirm the vulnerability with a working proof-of-concept, and file the finding. No hand-holding required.

The key design decision — and this was deliberate from day one — was to engineer the AI the way an expert pentester thinks. Strobes AI loads default skills and methodologies upfront, breaks the assessment into sub-agent tasks, crawls the target, and operates on what we internally call a "PoC approach": confirm it's exploitable, document the evidence, move on. No wasting cycles on theoretical findings that can't be demonstrated.

Here's how Strobes AI is set up to solve the pentest frictions:

  • Agents — Specialized AI personas, each built for a specific job. Web pentesting, network pentesting, API testing, code review, login & auth handling. They don't share a single monolithic prompt — each one has targeted capabilities and tooling.
  • Workspaces — Think of these as your engagement folder with a set of pentest projects, but persistent and queryable. Every asset, credential, file, shared table, finding, and task lives here. Day 3 of an engagement has full context from day 1.
  • Skills — Modular instruction sets following the open SKILL.md standard. They teach agents how to use specific tools or follow specific methodologies. You can write your own or use the built-in library.
  • Learnings — Knowledge extracted from prior workspace activity. If the agent mapped your auth flow on Monday, it remembers that on Wednesday without being told again.
  • Human in the Loop (HITL) — A governed approval layer. The agents are autonomous, not unsupervised. More on this later.
Strobes AI overview screen SMART mode enabled HITL toggle
Strobes AI overview screen — SMART mode enabled, HITL toggle
Strobes AI chat interface with quick actions surfacing common workflows
The chat interface where it all starts. Quick actions surface common workflows, so you're not typing prompts from scratch.

The Real Test: A Full Web App Pentest

To demonstrate what this looks like in practice, the security research & engineering team pointed Strobes AI at a real-world hosted app — a common benchmark target that makes results comparable across tools and testers.

Here's the important part: the team gave minimal instructions and no hand-crafted prompts. No step-by-step playbooks fed into the chat. They selected the Web App Pentest workflow template and let the agents figure out the rest. The whole point was to test whether the platform could operate the way an expert pentester would — load the right skills, break the assessment into phases, design test cases, execute them, and report findings — without someone babysitting every step.

It ran 32 independent agent tasks across 21 structured web app pentesting phases pre-loaded as per industry standards & security skills inherited at Strobes (All autonomous).

32
Tasks Completed
21
Workflow Phases
42
Vulnerabilities
41
Evidence Files
4
Shared Tables

Workspace overview — all 21 phases completed, 42 findings, 41 files. 6.8 AI credits consumed for the entire engagement.

Web App Pentest workflow template loaded by agents

The Web App Pentest workflow template — this is what gets loaded when you select the workflow. The agents take it from here.

How the Phases Played Out

The workflow maps directly to OWASP WSTG v4.2 categories & the Strobes security knowledge base. Each phase feeds into the next — the output of reconnaissance becomes the input for test case design, which becomes the task list for execution.

PhaseDescriptionTime
Phase 0 — Scope & AuthDefine scope, authenticate & understand target12m 48s
Phase 1 — Info GatheringTech stack analysis & fingerprinting5m 4s
Phase 2 — Dynamic CrawlingEndpoint discovery & attack surface mapping5m 26s
Phase 3a — Attack SurfaceEndpoint categorization & credential sweep5m 4s
Phase 3b — WSTG Design11 WSTG categories designed in parallel~15 min
Phase 3c — Test Plan MergeMerge test cases & create workspace tasks2s
Phases 6–17 — Full WSTGConfig, auth, session, injection, client-side, crypto, API, business logicParallel
Phases 18–20 — Wrap-upFinding validation, submission, and the pentest report14s total

Phase 3b is where it gets interesting. The platform spun up 11 concurrent sub-agents, one for each WSTG test category — CONF, IDNT, ATHN, SESS, ATHZ, INPV, CLNT, CRYP, ERRH, BUSL, APIT — all designing test cases in parallel. A human pentester would work through these one at a time. That's not a minor speedup; it's a different operating model.

Workflow execution phase progression with individual timings

Workflow execution — phase progression with individual timings. Every phase supports restart if you need to re-run.

WSTG test execution running across multiple categories simultaneously

WSTG test execution running across multiple categories simultaneously.

Configuration authentication and session management testing phases

Configuration, authentication, and session management testing phases.

Input validation client-side testing and cryptography checks

Input validation, client-side testing, and cryptography checks.

Business logic API testing and final reporting phase 14 seconds

Business logic, API testing, and the final reporting phase — 14 seconds to validate, submit, and generate the report.

Inside a task agent tool invocations and decision chain during execution

Inside a task — the agent's actual tool invocations and decision chain during execution.

Structured findings output with WSTG test IDs and evidence

Structured findings output with WSTG test IDs and evidence.

Vulnerability detail view request response pairs working payloads severity classification

Vulnerability detail view — request/response pairs, working payloads, severity classification.

42 Vulnerabilities. With Working Payloads.

Not theoretical findings. Not "possible" vulnerabilities flagged by a scanner with a confidence score. Every one of these 42 findings came with a working payload, request/response evidence, an OWASP WSTG test ID, and a severity classification. The "PoC" approach in action.

SeverityCountExamples
Critical22UNION SQLi on /artists.php, /listproducts.php, /categories.php, /cart.php, /userinfo.php, /guestbook.php, /search.php; Error-based SQLi on /AJAX/; Auth Bypass on /login.php; Plaintext Cookie Forgery
High8IDOR on /userinfo.php (Horizontal Privilege Escalation); Stored XSS on /guestbook.php; Path Traversal/LFI on /showimage.php; Admin directory listing exposing /admin/create.sql
Medium12Reflected XSS on /search.php, /artists.php, /listproducts.php; CSRF on guestbook & login forms; Missing HttpOnly/Secure/SameSite cookie flags

22 criticals is a big number, but keep in mind this is a deliberately vulnerable app. The real takeaway isn't the count — it's that the agents found UNION-based SQLi across 7 different endpoints, each confirmed with extracted data. That's the kind of thoroughness that usually requires a pentester to manually test each parameter individually.

The findings list 42 documented vulnerabilities with severity WSTG mapping and status tracking

The findings list — 42 documented vulnerabilities with severity, WSTG mapping, and status tracking.

Individual finding detail payload evidence remediation guidance filed directly into CTEM pipeline

Individual finding detail — payload, evidence, remediation guidance, all filed directly into the CTEM pipeline.

What Gets Left Behind (In a Good Way)

A pentest that finds vulnerabilities but doesn't produce clean evidence is only half done. Here's what the workspace contained when the agents finished:

  • 41 files organized across /access, /auth, /discovery, /docs, /phase2-crawl, /scope, and /test_cases — with markdown summaries for each phase
  • 4 shared tables: auth_tokens, Pentest Findings, Auth Testing, Attack Surface — Endpoints — all queryable by any agent in the workspace
  • 2 Learnings the platform extracted automatically: "Authentication Flow" (form-based login, no MFA, no CSRF protection) and "Attack Surface Map" (19 endpoints across 11 WSTG categories, with top SQLi candidates flagged)
  • A full pentest report generated in the Dashboard INSIGHT widget — executive summary, metrics, remediation guidance. The kind of deliverable that usually takes a day to write after the engagement ends.
Workspace file tree 41 evidence files organized into structured folders

Workspace file tree — 41 evidence files organized into structured folders.

Shared tables structured data that persists across the engagement and is accessible to all agents

Shared tables — structured data that persists across the engagement and is accessible to all agents.

The auto-generated pentest report in the Dashboard INSIGHT widget

The auto-generated pentest report in the Dashboard INSIGHT widget.

Why This Matters (Beyond the Demo)

It's easy to be impressed by a demo against a deliberately vulnerable app. We get that. But the architecture underneath is what matters for real-world use.

Minimal instructions, maximum coverage. Strobes's team didn't write a 500-line prompt. They selected a workflow template and the platform did what an expert pentester would do: loaded the right skills, broke the work into phases, designed test cases at the sub-agent level, and executed. The "PoC" philosophy means the agents don't waste cycles on theoretical findings; they confirm exploitability and move on.

Parallel execution changes the math. When 11 WSTG categories get tested simultaneously instead of sequentially, you're not saving 10% of the time. You're compressing what would be a multi-day manual effort into minutes. And unlike a human context-switching between test categories, each sub-agent has full focus on its specific domain.

Persistent memory across the engagement. Every crawled endpoint, every tested parameter, and every credential attempt is stored in shared tables and learnings. When a sub-agent picks up a new task on day 3, it has the full context from day 1 without anyone re-briefing it.

The evidence package writes itself. No more spending the day after an engagement assembling screenshots and writing the report. The workspace contains organized files, structured findings, and a generated report by the time the agents mark the engagement complete.

Knowledge-based & Skill-based testing. Strobes AI is engineered to run efficiently at default to ensure it meets industry standards for pentest, and at the same time is customisable enough to add more agent skills, knowledge base, and learnings in a pentest journey to bring out maximum efficiency.

Where This Is Going

What we showed here is a sample web app pentest. The same architecture powers network pentesting, API testing, cloud security reviews, and code analysis — all within the same workspace model, all with the same PoC-first, evidence-driven & pentester skills inherited approach.

For pentesters: this doesn't replace what you do. It removes the ceiling on how much of your expertise you can apply at once. You're still the one who decides what's in scope, reviews edge cases, and makes the judgment calls that require experience. The agents handle the breadth. You handle the depth.

For security teams: continuous offensive validation stops being something you budget for quarterly and starts being something that runs alongside your CI/CD pipeline. Same methodology, same rigor, without the scheduling bottleneck.

Want to go deeper on the architecture powering this? Read how we built the AI harness for offensive security — the orchestration, tooling, and validation layers that make agentic pentesting reliable at production scale. And if you're curious how the crawling phase works in detail, check out why crawling is the hardest part of AI-powered pen testing.

Based on live workspace data from the Strobes AI exposure management platform. Engagement led by Prakash Ashok and the Agentic Security Engineering team.