
In Verizon's 2024 Data Breach Investigations Report, the median time for an organization to detect a breach is still measured in days to weeks, while the attacker needs only hours to reach their objective. That gap, between when an intruder lands and when anyone notices, is the single thing a red team assessment is built to measure. It is not a hunt for vulnerabilities. It is a test of whether you would see a capable adversary moving through your network in time to stop them.
This guide walks through what a red team assessment actually involves: a real attack narrative from phishing to objective, the detection gaps it surfaces, the metrics that score it, and where it differs from a penetration test. If you have a SOC, EDR, and an incident-response process you have never tested under realistic pressure, this is the assessment built for that.
A red team assessment is an objective-driven exercise where testers emulate the tactics, techniques, and procedures (TTPs) of a real threat actor to reach a specific goal without being detected. Instead of enumerating flaws across a scope, the team picks a target outcome agreed in scoping and works toward it across whatever vectors are in play: external infrastructure, phishing, physical access, or a supplied foothold.
The objective is written down before anything starts, as a flag the white team can verify. Defining it precisely is what keeps the engagement honest and safe:
OBJECTIVE Demonstrate ability to initiate a wire transfer
from the treasury application (TREASURY-WEB01).
FLAG Screenshot of payment-initiation screen
+ contents of \\fin-fs01\treasury\flag.txt
OUT OF Real funds movement, DoS, destruction of
SCOPE production data, any action on TREASURY-PROD.
WIN Flag captured, OR red team detected and
CONDITION ejected before capture (a win for blue).That last line matters: getting caught is a result, not a failure. The blue team usually does not know the exercise is happening, or knows only that one may occur in a window, because a red team measures real detection. TTPs are mapped to MITRE ATT&CK so a foothold via phishing becomes T1566, reuse of stolen credentials becomes T1078 (Valid Accounts), host-to-host pivoting becomes T1021 (Remote Services), and credential theft from memory becomes T1003, each a behavior a defender can build a detection around.
The clearest way to understand a red team assessment is to watch one unfold against the treasury objective above. Here is a condensed narrative of how the campaign actually runs.
The team spends the first week on passive OSINT: harvesting employee names from LinkedIn, mapping the external attack surface, and identifying who handles finance. They send a spearphishing lure (T1566.001, Spearphishing Attachment) to three of those staff. One opens it, and a beacon checks in to a Sliver command-and-control server hosted behind a redirector, the channel shaped to look like routine HTTPS traffic. From that foothold the operators run BloodHound to collect Active Directory data, which renders the path to the objective as a graph:
$ # BloodHound shortest-path query result (abridged)
MATCH p=shortestPath((u:User {name:'JDOE@CORP'})
-[*1..]->(g:Group {name:'DOMAIN ADMINS@CORP'}))
JDOE --MemberOf--> IT-SUPPORT
IT-SUPPORT --GenericAll--> SVC-BACKUP <- over-privileged service acct
SVC-BACKUP --AdminTo--> FIN-JUMP01 <- finance jump host
FIN-JUMP01 --HasSession--> treasury operator sessionThat GenericAll edge on a service account is the whole game. The operators harvest the SVC-BACKUP credential (T1078, Valid Accounts), use it to move laterally (T1021) to FIN-JUMP01, dump credentials from memory there (T1003), and ride an existing treasury-operator session to the payment screen. They never trigger ransomware-style impact (T1486) because the rules of engagement forbid it; they capture the flag file and screenshot the screen.
The result that matters is not the capture. It is the silence. Several steps SHOULD have fired an alert and did not.
The deliverable that earns a red team its fee is the gap analysis: a line-by-line account of what the attacker did, the ATT&CK technique behind it, and the detection that should have fired. For the treasury campaign above, the core of that table looked like this (rendered as a visual below). Each missed row is a concrete piece of detection-engineering work, not a vague recommendation.
The pattern is almost always the same. Perimeter and email controls catch the loud, well-known stuff. The interior, the lateral movement, the credential reuse, the service-account abuse, is where coverage collapses, and that is exactly where a real breach turns into a headline. A clean win for the defenders is not zero compromise; it is detecting the campaign at credential dumping (T1003) and containing it before the operators reach the finance zone. Mapping each gap to a technique ID lets you show coverage moving from red to green over successive engagements rather than guessing whether you improved.
A penetration test is coverage-based and asks 'what vulnerabilities exist in this scope?', while a red team assessment is goal-based and asks 'can a real adversary reach this objective without us noticing?'. A pentest wants breadth across a defined target; a red team wants depth toward one outcome, and treats getting caught as a finding in itself.
That difference cascades through everything else. Scope: a pentest has a tight agreed list (these IPs, this app); a red team has a broad scope and narrow objective spanning network, social engineering, and physical vectors. Stealth: a pentester works loudly and efficiently and the blue team usually knows; a red team prioritizes evasion because detection is what they are testing. Duration: pentests run days to a couple of weeks, red team engagements run weeks to months to mirror a patient attacker. Output: a pentest delivers a ranked vulnerability list; a red team delivers an attack narrative, a detection-and-response gap analysis, and a timeline of what fired and what did not.
If you are still deciding which fits, our guide to the types of penetration testing covers where each sits, and the penetration testing overview sets the baseline a red team builds on. A common sequencing mistake: buying a red team before you have any detection to test. If the SOC cannot see anything, the team simply walks to the objective and the report tells you what you already knew.
You score a red team by detection and response, not by whether the flag was captured. Three numbers carry the verdict, and each has a concrete formula:
A team that captures the objective in eight hours but is detected at hour two has given you a better outcome than one caught only in the final debrief, because the metrics, not the flag, are the product. The number that compounds is the conversion rate afterward: how many missed techniques became durable detections within 30 days. Track all four across engagements and you get a trend line for real-world resilience instead of a one-off war story.
Threat-led penetration testing (TLPT) is a regulated form of red teaming that uses real cyber threat intelligence to shape the scenario, so the simulated attack mirrors the actors most likely to target your organization. Instead of a generic adversary, a threat-intelligence provider profiles relevant groups, and the red team emulates their specific TTPs against live production systems.
The best-known frameworks are TIBER-EU (the European Central Bank's model, now reinforced by DORA for EU financial entities), the Bank of England's CBEST, and similar programs elsewhere. They share a structure: a threat-intelligence phase, a red team phase against production, and a tightly controlled white team coordinating both sides. These engagements are heavily governed precisely because they hit real systems, and they are usually reserved for systemically important institutions. In our experience the threat-intel phase is also where many programs first learn their actual perimeter is larger than their asset inventory said.
A red team assessment matters because it tests the one thing a vulnerability list cannot: whether your defenders would actually catch and stop an intrusion in progress. You can patch every CVE a pentest finds and still lose to an attacker who phishes a credential, lands a foothold, and moves laterally for weeks because nobody was watching the right telemetry.
The lasting value is the debrief and the blue-team uplift that follows. A good red team hands defenders a timeline of every action mapped to MITRE ATT&CK, showing which techniques were detected, which were missed, and which alerts fired but were ignored. That feeds detection engineering directly: a Sigma-style rule for the BloodHound LDAP storm, an alert on service-account logons from unusual hosts, an LSASS-access detection for credential dumping, tighter segmentation around the finance zone. This is the foundation of purple teaming, where red and blue close each gap on the spot rather than waiting weeks for a report. Run continuously rather than once a year, that loop is where agentic pentesting changes the economics, keeping detections honest between set-piece exercises.