
Agentic Pentesting with Strobes AI
Ask any pentester what kills an engagement, and it's rarely the technical difficulty. It's the clock. You scope the target, fire up Burp, start crawling, find something interesting on endpoint #3, spend forty minutes confirming it, write it up, then realize you've burned half a day and haven't touched authentication testing yet.
We've all been there. The coverage gaps aren't from lack of skill; they're from the simple fact that one person can only type one command at a time.
That's the problem our Research & Engineering team set out to solve with Strobes AI. Not by replacing pentesters, but by giving them something they've never had: persistent, multi-agent workspaces that can run an entire OWASP WSTG assessment autonomously while you focus on the parts that actually need a human brain.
How Does "Strobes Agentic AI" Better Pentests?
The term gets thrown around a lot, so let's be specific. When we say agentic pentesting, we mean AI agents that don't just answer your questions about CVEs or suggest payloads. They actually execute. They run the scan, read the response, decide what to try next, confirm the vulnerability with a working proof-of-concept, and file the finding. No hand-holding required.
The key design decision — and this was deliberate from day one — was to engineer the AI the way an expert pentester thinks. Strobes AI loads default skills and methodologies upfront, breaks the assessment into sub-agent tasks, crawls the target, and operates on what we internally call a "PoC approach": confirm it's exploitable, document the evidence, move on. No wasting cycles on theoretical findings that can't be demonstrated.
Here's how Strobes AI is set up to solve the pentest frictions:
- Agents — Specialized AI personas, each built for a specific job. Web pentesting, network pentesting, API testing, code review, login & auth handling. They don't share a single monolithic prompt — each one has targeted capabilities and tooling.
- Workspaces — Think of these as your engagement folder with a set of pentest projects, but persistent and queryable. Every asset, credential, file, shared table, finding, and task lives here. Day 3 of an engagement has full context from day 1.
- Skills — Modular instruction sets following the open SKILL.md standard. They teach agents how to use specific tools or follow specific methodologies. You can write your own or use the built-in library.
- Learnings — Knowledge extracted from prior workspace activity. If the agent mapped your auth flow on Monday, it remembers that on Wednesday without being told again.
- Human in the Loop (HITL) — A governed approval layer. The agents are autonomous, not unsupervised. More on this later.


The Real Test: A Full Web App Pentest
To demonstrate what this looks like in practice, the security research & engineering team pointed Strobes AI at a real-world hosted app — a common benchmark target that makes results comparable across tools and testers.
Here's the important part: the team gave minimal instructions and no hand-crafted prompts. No step-by-step playbooks fed into the chat. They selected the Web App Pentest workflow template and let the agents figure out the rest. The whole point was to test whether the platform could operate the way an expert pentester would — load the right skills, break the assessment into phases, design test cases, execute them, and report findings — without someone babysitting every step.
It ran 32 independent agent tasks across 21 structured web app pentesting phases pre-loaded as per industry standards & security skills inherited at Strobes (All autonomous).
Workspace overview — all 21 phases completed, 42 findings, 41 files. 6.8 AI credits consumed for the entire engagement.

The Web App Pentest workflow template — this is what gets loaded when you select the workflow. The agents take it from here.
How the Phases Played Out
The workflow maps directly to OWASP WSTG v4.2 categories & the Strobes security knowledge base. Each phase feeds into the next — the output of reconnaissance becomes the input for test case design, which becomes the task list for execution.
| Phase | Description | Time |
|---|---|---|
| Phase 0 — Scope & Auth | Define scope, authenticate & understand target | 12m 48s |
| Phase 1 — Info Gathering | Tech stack analysis & fingerprinting | 5m 4s |
| Phase 2 — Dynamic Crawling | Endpoint discovery & attack surface mapping | 5m 26s |
| Phase 3a — Attack Surface | Endpoint categorization & credential sweep | 5m 4s |
| Phase 3b — WSTG Design | 11 WSTG categories designed in parallel | ~15 min |
| Phase 3c — Test Plan Merge | Merge test cases & create workspace tasks | 2s |
| Phases 6–17 — Full WSTG | Config, auth, session, injection, client-side, crypto, API, business logic | Parallel |
| Phases 18–20 — Wrap-up | Finding validation, submission, and the pentest report | 14s total |
Phase 3b is where it gets interesting. The platform spun up 11 concurrent sub-agents, one for each WSTG test category — CONF, IDNT, ATHN, SESS, ATHZ, INPV, CLNT, CRYP, ERRH, BUSL, APIT — all designing test cases in parallel. A human pentester would work through these one at a time. That's not a minor speedup; it's a different operating model.

Workflow execution — phase progression with individual timings. Every phase supports restart if you need to re-run.

WSTG test execution running across multiple categories simultaneously.

Configuration, authentication, and session management testing phases.

Input validation, client-side testing, and cryptography checks.

Business logic, API testing, and the final reporting phase — 14 seconds to validate, submit, and generate the report.

Inside a task — the agent's actual tool invocations and decision chain during execution.

Structured findings output with WSTG test IDs and evidence.

Vulnerability detail view — request/response pairs, working payloads, severity classification.
42 Vulnerabilities. With Working Payloads.
Not theoretical findings. Not "possible" vulnerabilities flagged by a scanner with a confidence score. Every one of these 42 findings came with a working payload, request/response evidence, an OWASP WSTG test ID, and a severity classification. The "PoC" approach in action.
| Severity | Count | Examples |
|---|---|---|
| Critical | 22 | UNION SQLi on /artists.php, /listproducts.php, /categories.php, /cart.php, /userinfo.php, /guestbook.php, /search.php; Error-based SQLi on /AJAX/; Auth Bypass on /login.php; Plaintext Cookie Forgery |
| High | 8 | IDOR on /userinfo.php (Horizontal Privilege Escalation); Stored XSS on /guestbook.php; Path Traversal/LFI on /showimage.php; Admin directory listing exposing /admin/create.sql |
| Medium | 12 | Reflected XSS on /search.php, /artists.php, /listproducts.php; CSRF on guestbook & login forms; Missing HttpOnly/Secure/SameSite cookie flags |
22 criticals is a big number, but keep in mind this is a deliberately vulnerable app. The real takeaway isn't the count — it's that the agents found UNION-based SQLi across 7 different endpoints, each confirmed with extracted data. That's the kind of thoroughness that usually requires a pentester to manually test each parameter individually.

The findings list — 42 documented vulnerabilities with severity, WSTG mapping, and status tracking.

Individual finding detail — payload, evidence, remediation guidance, all filed directly into the CTEM pipeline.
What Gets Left Behind (In a Good Way)
A pentest that finds vulnerabilities but doesn't produce clean evidence is only half done. Here's what the workspace contained when the agents finished:
- 41 files organized across /access, /auth, /discovery, /docs, /phase2-crawl, /scope, and /test_cases — with markdown summaries for each phase
- 4 shared tables: auth_tokens, Pentest Findings, Auth Testing, Attack Surface — Endpoints — all queryable by any agent in the workspace
- 2 Learnings the platform extracted automatically: "Authentication Flow" (form-based login, no MFA, no CSRF protection) and "Attack Surface Map" (19 endpoints across 11 WSTG categories, with top SQLi candidates flagged)
- A full pentest report generated in the Dashboard INSIGHT widget — executive summary, metrics, remediation guidance. The kind of deliverable that usually takes a day to write after the engagement ends.

Workspace file tree — 41 evidence files organized into structured folders.

Shared tables — structured data that persists across the engagement and is accessible to all agents.

The auto-generated pentest report in the Dashboard INSIGHT widget.
Why This Matters (Beyond the Demo)
It's easy to be impressed by a demo against a deliberately vulnerable app. We get that. But the architecture underneath is what matters for real-world use.
Minimal instructions, maximum coverage. Strobes's team didn't write a 500-line prompt. They selected a workflow template and the platform did what an expert pentester would do: loaded the right skills, broke the work into phases, designed test cases at the sub-agent level, and executed. The "PoC" philosophy means the agents don't waste cycles on theoretical findings; they confirm exploitability and move on.
Parallel execution changes the math. When 11 WSTG categories get tested simultaneously instead of sequentially, you're not saving 10% of the time. You're compressing what would be a multi-day manual effort into minutes. And unlike a human context-switching between test categories, each sub-agent has full focus on its specific domain.
Persistent memory across the engagement. Every crawled endpoint, every tested parameter, and every credential attempt is stored in shared tables and learnings. When a sub-agent picks up a new task on day 3, it has the full context from day 1 without anyone re-briefing it.
The evidence package writes itself. No more spending the day after an engagement assembling screenshots and writing the report. The workspace contains organized files, structured findings, and a generated report by the time the agents mark the engagement complete.
Knowledge-based & Skill-based testing. Strobes AI is engineered to run efficiently at default to ensure it meets industry standards for pentest, and at the same time is customisable enough to add more agent skills, knowledge base, and learnings in a pentest journey to bring out maximum efficiency.
Where This Is Going
What we showed here is a sample web app pentest. The same architecture powers network pentesting, API testing, cloud security reviews, and code analysis — all within the same workspace model, all with the same PoC-first, evidence-driven & pentester skills inherited approach.
For pentesters: this doesn't replace what you do. It removes the ceiling on how much of your expertise you can apply at once. You're still the one who decides what's in scope, reviews edge cases, and makes the judgment calls that require experience. The agents handle the breadth. You handle the depth.
For security teams: continuous offensive validation stops being something you budget for quarterly and starts being something that runs alongside your CI/CD pipeline. Same methodology, same rigor, without the scheduling bottleneck.
Want to go deeper on the architecture powering this? Read how we built the AI harness for offensive security — the orchestration, tooling, and validation layers that make agentic pentesting reliable at production scale. And if you're curious how the crawling phase works in detail, check out why crawling is the hardest part of AI-powered pen testing.
Based on live workspace data from the Strobes AI exposure management platform. Engagement led by Prakash Ashok and the Agentic Security Engineering team.