Regex in Practice: Anchors, Quantifiers, and Capture Groups
Summary (TL;DR)
On 2 July 2019, a single regex inside Cloudflare’s WAF — a near-miss pattern of the shape .*(?:.*=.*) — took global traffic down by close to 50% for 27 minutes. PCRE’s backtracking pinned a CPU core at 100% and every request stalled behind it. The most honest lesson regex teaches is that it hides its own cost. Most days it is small, fast, and seductive; one bad input can stop a system. That is why this post is mildly skeptical by tone. Most everyday regex work rests on a handful of pieces: anchors that tie a pattern to the start, the end, or a word boundary; character classes that describe “any of these characters”; quantifiers that say “how many”; and groups that let you capture, reference, or alternate. Get those right and the typical problems — validating a rough email, picking fields out of a log line, normalizing a phone number — become short and readable. Get them wrong and you end up with patterns that match too much, too little, or bring the engine to a halt as in the Cloudflare case. Do not parse HTML, JSON, or XML with regex — regular languages cannot describe balanced nesting. Engines also differ: ECMAScript regex, PCRE (Perl/PHP), Python’s re, Oniguruma (Ruby and Rust’s onig crate), and Go’s RE2 each have subtle differences. Cloudflare’s incident only happened because the engine was PCRE; under RE2 it would have completed in linear time.
Background
A regular expression is a compact grammar for matching strings. The pieces you reach for most often are worth naming explicitly.
Anchors do not consume characters — they assert a position. ^ is the start of the input (or start of a line in multiline mode), $ is the end, and \b is a word boundary (the transition between \w and non-\w). Without anchors, abc matches anywhere inside a longer string. With them, ^abc$ matches only the literal whole-string abc, and \babc\b matches abc as a standalone word.
Character classes describe a single character from a set. [abc] is a, b, or c; [a-z] is any lowercase letter; [^0-9] is “anything but a digit.” Shorthand classes cover the common cases: \d (digits), \w (word characters — letters, digits, underscore), \s (whitespace). The negations \D, \W, \S are the inverses. . matches any character except a newline by default; the s (dotall) flag changes that.
Quantifiers attach to the previous token and say how many times it may repeat. * is zero or more, + is one or more, ? is zero or one, and {n,m} is “between n and m” (either bound may be omitted). By default quantifiers are greedy: they match as much as possible and give back only if the rest of the pattern fails. Append ? for the lazy form — *?, +?, ?? — which matches as little as possible and only extends when forced. Some engines add possessive quantifiers (*+, ++) that match greedily but refuse to give back on failure, which can prevent catastrophic backtracking on pathological input — exactly the tool missing from the Cloudflare incident.
Groups wrap subpatterns. (pattern) is a capturing group — numbered from 1 in order of opening parenthesis — whose match can be referenced with \1 inside the pattern or with indexed accessors in the host language. (?<name>pattern) adds a named capture. (?:pattern) is a non-capturing group, used purely for alternation ((?:cat|dog)) or quantifier grouping without recording the match. | inside a group is alternation: match one of the alternatives.
Finally, flags change engine behavior: i for case-insensitive, m for multiline (^ and $ match line boundaries), s for dotall, u for Unicode. In JavaScript, the u flag also enables Unicode property escapes like \p{L} for “any letter.”
Data / Comparison
| Quantifier | Greedy (default) | Lazy (? suffix) | Possessive (where supported) |
|---|---|---|---|
* / *? / *+ | Match as many as possible, backtrack on failure | Match as few as possible, extend on failure | Match as many as possible, no backtrack |
+ / +? / ++ | One or more, greedy | One or more, lazy | One or more, no backtrack |
? / ?? / ?+ | Zero or one, prefer one | Zero or one, prefer zero | Zero or one, no backtrack |
{n,m} / {n,m}? / {n,m}+ | Range, greedy | Range, lazy | Range, no backtrack |
Greedy is the right default often enough that it is the default. Laziness matters when “the rest of the pattern” is itself permissive — for example, pulling <b>...</b> out of an HTML excerpt with <b>(.*?)</b> instead of a greedy .* that might swallow multiple tags (and even then, do not parse real HTML with regex). Possessive quantifiers and atomic groups (?>...) help when a pattern would otherwise re-explore exponentially many backtrack paths on near-matches. PCRE, Java, and Oniguruma support them; ECMAScript and Python re historically did not, though Python 3.11 added atomic groups to the re module.
Real-world Scenarios
Scenario 1 — A pragmatic email match. The full email grammar from RFC 5322 admits comments, quoted local parts, and nested IP literals; a regex that covers all of it is notoriously monstrous (the most famous attempt is 6,425 characters long) and still not a real parser. The pattern I have actually shipped in production is almost always ^[^\s@]+@[^\s@]+\.[^\s@]+$ — “non-empty, non-whitespace, no stray @, with at least one dot in the domain” — which rejects the obvious typos without pretending to fully validate. The only way to truly verify an address is to send it a message. Use regex for the shape, email for the existence.
Scenario 2 — Phone numbers with international formats. +82 2-1234-5678, (02) 1234-5678, and 82-2-1234-5678 all describe the same Korean Seoul number. A regex like ^\+?\d{1,3}[-\s().]*\d{1,4}[-\s().]*\d{3,4}[-\s().]*\d{3,4}$ accepts the common punctuation and then a normalization step strips the punctuation to a canonical digits-only form. For anything serious — routing, storage, dial — use Google’s libphonenumber rather than rolling your own. The fastest meeting outcome I have seen on this topic was “we are not doing this with regex,” which saved the team a week of edge-case bugs. Regex is for the “does this look vaguely like a phone number” surface.
Scenario 3 — Extracting fields from a log line. A line like 2026-04-13T02:11:05Z 192.0.2.42 "GET /search?q=foo HTTP/1.1" 200 1534 can be split with a single pattern: ^(?<ts>\S+)\s+(?<ip>\S+)\s+"(?<method>\w+)\s+(?<path>\S+)\s+\S+"\s+(?<status>\d+)\s+(?<bytes>\d+)$. Named groups pay off here: the resulting match object is dictionary-like and each field is addressable by name. When the log format changes, the pattern is also the documentation of what you are parsing.
Common Misconceptions
“A regex can fully validate an email address.” It can only validate the shape. RFC 5322 is too complex to encode sensibly in a regex — and even if you did, “shape is valid” does not mean “mailbox exists.” The industry-standard pattern is a simple regex plus a verification email.
“Greedy is always slower than lazy.” Not inherently. Greedy matches can be faster when the quantifier’s subpattern is very restrictive, because the engine finishes in one long forward pass. Lazy wins when “the rest of the pattern” anchors the match, as in <b>(.*?)</b>. Benchmark with realistic inputs rather than reaching for ? reflexively.
“All regex engines are the same.” They are not. ECMAScript regex lacks possessive quantifiers and atomic groups (the v flag in modern engines closes a few gaps but not those); Python’s re has its own Unicode property set; PCRE supports back-references and recursive patterns; Oniguruma — the engine used by Ruby and Rust’s onig crate — is yet another dialect. Go’s RE2 drops back-references and lookarounds in exchange for execution time guaranteed linear in the input — the same engine Cloudflare evaluated migrating to after their incident. A pattern you copy from a Perl tutorial may not work in JavaScript, and vice versa.
“Regex can parse HTML (or JSON, or XML).” No, because regular languages cannot describe balanced nesting. Regex can extract a specific, well-formed subpattern from structured text — a single attribute value, for example — but it cannot correctly parse the whole tree. For nested formats, use a dedicated parser (DOMParser, JSON.parse, an XML library, a CSV reader). The Stack Overflow “regex vs HTML” saga (the 2009 answer) is a cautionary tale, not a debate.
Checklist
- What is the input, and what are the counter-examples? Write both before writing the pattern.
- Is the data nested or recursive? If yes, use a parser. Regex is the wrong tool.
- Which engine are you targeting? JavaScript, Python, Go, PCRE each differ on lookaround, back-references, Unicode.
- Do you need the match itself or just a yes/no? Prefer non-capturing
(?:...)for groups that only exist for alternation or quantification. - Is the pattern user-supplied, or applied to untrusted input? Guard against catastrophic backtracking with a time limit, atomic groups, or a linear-time engine like RE2. Otherwise you become Cloudflare.
- Are you performing text normalization afterwards? Do not encode everything into one enormous pattern; pair a simple shape check with a small post-processing step.
- Is the regex documented? Multi-line mode with
x(extended) or a comment above the pattern is cheap insurance for the next person to read it.
Related Tool
The Patrache Studio regex tester runs patterns against sample input in the browser and shows group captures inline, which is faster than shuttling between tabs. When the strings you are matching are themselves structured — log lines with JSON payloads, API responses — combine the regex work with JSON Formatting, Validation, and Schema in Practice so that the extracted piece is validated by a proper parser rather than a second regex. A common regex target is a UUID embedded in a URL; UUID v1 vs v4 vs v7: Picking a DB Primary Key covers why the same 36 characters can mean different things depending on the version bits.
References
- MDN, “Regular Expressions” Guide — https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions
- IETF RFC 5322, “Internet Message Format” (email grammar) — https://datatracker.ietf.org/doc/html/rfc5322
- regex101, interactive tester with flavor selection — https://regex101.com/
- Google, RE2 — linear-time regex engine — https://github.com/google/re2/wiki/Syntax