
Veracode's 2024 State of Software Security report found that around 80% of applications carry at least one flaw on first scan, and that flaws left unfixed have a half-life measured in months, not days. That backlog does not come from exotic zero-days. It comes from the same handful of sinks getting concatenated, exec'd, and deserialized in pull request after pull request, faster than anyone reads them. Secure code review is how you catch those before they merge.
This is a working guide, not a definition. You will trace a single tainted parameter from an HTTP source all the way to a SQL sink, see three vulnerable-versus-fixed pairs, read a real Semgrep rule and the CodeQL source/sink/sanitizer model, and wire the whole thing into a PR gate developers will actually trust. The OWASP Code Review Guide is the reference standard underneath all of it.
Strip away the tooling and secure code review is one repeatable move: find a dangerous operation (a sink), walk backward through the code to where the data came from (a source), and decide whether anything trustworthy sanitized it on the way. If an attacker-controlled source reaches a sink with no sanitizer between them, you have a finding. That backward walk is faster than reading top to bottom because there are far fewer sinks than inputs.
Sources are anything an attacker influences: query parameters, headers, cookies, JSON bodies, uploaded files, queue messages, and even rows read back from a database that was poisoned earlier. Sinks are where that data does damage: a SQL statement, an OS command, an HTML response, a file path, a deserializer. Sanitizers are the trustworthy things in between: parameterized queries, output encoding, allow-list validation, safe library APIs. Here is the walk made concrete on one request, annotated line by line:
# 1. SOURCE attacker controls the raw value
order_id = request.args.get("order") # ?order=5 OR 1=1
# 2. PROPAGATION value flows untouched into a string
q = "SELECT total FROM orders WHERE id = " + order_id
# 3. (no sanitizer: nothing validates or binds order_id)
# 4. SINK the string is parsed as SQL by the driver
cursor.execute(q) # <- 'OR 1=1' makes the WHERE always trueReading it backward: the sink is line 4, the source is line 1, and the gap where a parameterized query should live is empty. That is the entire judgment call, repeated for every sink in the diff. This is the same mental model that threat modeling with STRIDE applies one layer up at the design, where a missing trust boundary predicts exactly this kind of missing sanitizer.
Most exploitable bugs are not subtle. They are the same three sinks fed raw input, and learning their before/after shapes lets you spot them in seconds. First, SQL built by string concatenation versus a bound parameter. The driver only stops parsing input as code when you bind it.
# VULNERABLE: userId is spliced into the SQL text
userId = request.args.get("id")
query = "SELECT * FROM accounts WHERE id = '" + userId + "'"
cursor.execute(query) # injection: id = ' OR '1'='1
# FIXED: value travels as data, never parsed as SQL
cursor.execute("SELECT * FROM accounts WHERE id = %s", (userId,))Second, command injection. os.system hands a single string to /bin/sh, so any shell metacharacter in the input runs. The fix is to skip the shell entirely and pass an argument list, which removes word-splitting and metacharacter parsing.
# VULNERABLE: "x.png; rm -rf /" runs as two shell commands
filename = request.form["file"]
os.system("convert " + filename + " out.png")
# FIXED: no shell, args passed as a list (no word-splitting)
subprocess.run(["convert", filename, "out.png"], shell=False, check=True)Third, unsafe deserialization. pickle.loads will execute code embedded in a crafted payload, so any pickle crossing a trust boundary is remote code execution waiting to happen. Switch to a data-only format that has no code-execution semantics.
# VULNERABLE: a malicious pickle runs code on load
data = pickle.loads(request.data) # RCE sink
# FIXED: json parses data only, never instantiates objects
data = json.loads(request.data)The path-traversal and request-forgery sinks follow the same pattern of raw input reaching a dangerous API, which is why a finding here usually predicts a finding when you later run SSRF testing against the running system.
Semgrep matches patterns that read like the code itself, and CodeQL runs interprocedural data-flow queries against a database it builds from your code. Use Semgrep for cheap, fast CI gates on known-bad shapes, and CodeQL when you need to prove a tainted path actually connects across several functions before you believe the finding.
A real Semgrep rule for the pickle sink is small enough to read in one breath. It declares the language, severity, the message a reviewer sees, and the pattern to match:
rules:
- id: python-unsafe-pickle-load
languages: [python]
severity: ERROR
message: >
pickle.loads on untrusted data allows remote code execution.
Use json.loads or a schema-validated format instead.
metadata:
cwe: "CWE-502: Deserialization of Untrusted Data"
owasp: "A08:2021 Software and Data Integrity Failures"
patterns:
- pattern-either:
- pattern: pickle.loads(...)
- pattern: pickle.load(...)That rule is syntactic: it fires on the call no matter where the data came from, which is why it can produce false positives on a pickle of trusted local data. CodeQL solves that by asking the harder question, does attacker-controlled data actually reach this sink? You model three things and let its data-flow library connect them:
// CodeQL taint-tracking model (conceptual)
// SOURCE where untrusted data enters
// e.g. a Flask request.args / request.form access
// SINK the dangerous operation
// e.g. an argument to cursor.execute() or os.system()
// SANITIZER a node that makes data safe, stopping the flow
// e.g. an int() cast, an allow-list check, parameter binding
//
// CodeQL reports a result only when a path exists from a
// SOURCE to a SINK with NO SANITIZER on it, tracing across
// function and file boundaries automatically.The practical difference: Semgrep tells you the dangerous call exists; CodeQL tells you it is reachable by an attacker. Reach for the first to block obvious patterns at scale, the second to confirm a high-severity finding before you page someone. Both reflect the broader split between fast automation and careful judgment covered in automated versus manual penetration testing.
Findings only matter if they reach a developer with severity, evidence, and a fix they can paste. Below is the shape of what lands at the end of a review: each row pairs a CVSS-scored issue with the exact line of evidence and the config-level remediation, not advice to "validate input." The compare visual after this section shows a representative excerpt from one such report.
The remediation column is the part teams skip and the part that actually closes the bug. "Use a parameterized query" becomes cursor.execute(sql, params); "avoid the shell" becomes shell=False with an argument list; "do not deserialize untrusted data" becomes json.loads plus a schema check. A finding without that concrete snippet gets reopened next sprint.
SAST is blind to anything that depends on intent rather than syntax: broken authorization, business-logic abuse, and whether a flagged path is reachable at all. A green scan is not proof of safety; it is proof that no pattern matched. That gap is exactly where a human reviewer earns the time.
On a recent review of a fintech payments service, every Semgrep and CodeQL rule passed clean. Reading the transfer endpoint by hand, we found it trusted a fromAccount ID supplied in the request body and never checked it against the authenticated session, so any user could move money out of any account by changing one number. No tool flags that, because the code is syntactically perfect; only a reviewer who understands the trust model sees the missing ownership check. That class of broken object-level authorization overlaps directly with what surfaces during business logic and payment tampering testing on the running app.
The recurring human wins: authorization (the missing ownership check on an object), multi-step logic (a coupon applied twice, a state machine skipped), trust assumptions (a client-set role field believed server-side), and reachability (deciding whether a tool's flagged path can actually be hit). Budget your manual time on anything touching auth, crypto, deserialization, file handling, and money. Let the scanners own the rote sinks.
Run the security review on the diff, in the pull request, before merge, and block only on rules you trust completely. Reviewing the changed lines plus their immediate context is cheaper and more accurate than a quarterly full-repo audit, and the feedback lands while the author still has the change in their head.
A setup that holds up in practice: Semgrep runs on every PR and posts findings as inline comments, with merge blocked only on a curated high-confidence ruleset (the three sinks above, hardcoded secrets, disabled TLS verification). CodeQL runs on a schedule or on the default branch for the heavier interprocedural analysis, where a few minutes of runtime is acceptable. A human reviewer is auto-requested on any diff touching auth, crypto, deserialization, or query building, tool-flagged or not. Keep PRs small so the security-relevant line is not buried in a thousand-line refactor.
The failure mode to avoid is a noisy gate. The moment a scanner blocks merges on findings developers do not believe, they blanket-mute it and you are worse off than with no scanner. Tune the blocking set down, encode your framework's safe wrappers as custom rules so they stop firing, and feed every confirmed bug back as a regression rule. This per-change, continuous posture is the same one behind agentic pentesting, which keeps assessment running against every build instead of once a quarter.