Detection Engineering

The Email Relationship Graph as a Security Primitive

April 17, 2025 Dmitri Volkov 6 min read

Abstract network graph visualization representing email communication patterns

Most email security tools treat every message as if it arrived in a vacuum. Sender, subject line, body content — analyzed in isolation, scored in isolation, acted on in isolation. That model has a fundamental blind spot. Impersonation attacks don't just manipulate content. They manipulate context. And context is exactly what a relationship graph encodes.

Here's the thing: the 90-day send-receive history for any given mailbox is not just a log. It is a behavioral fingerprint. The frequency of communication with specific domains, the directionality of replies, the typical subject-line patterns between two parties — all of this encodes what normal looks like for that person. When something deviates from that normal, the graph detects it immediately, regardless of what the message body says.

What the Graph Actually Tracks

We're not storing message bodies. Full stop. The relationship graph ingests metadata only: sender address, recipient address, a hash of the subject line, reply-chain depth, and timing patterns. No content. No attachments. Just the structural facts of who talks to whom, how often, and in what direction.

Over 90 days, that metadata accumulates into something meaningful. An employee in accounts payable who exchanges 14 messages per month with a vendor's billing contact has a measurable relationship signal. The vendor's domain, the specific sending address, the reply cadence — all of it becomes baseline. We've found that 90 days is the right window: long enough to capture seasonal variation, short enough that job changes and vendor switches don't corrupt the model with stale data.

What we track per relationship edge:

Message volume, including sent-vs-received asymmetry
Average reply-chain depth (a deep chain means established dialogue)
Subject-line hash clustering, since recurring threads have consistent hash neighborhoods
Inter-message timing distribution, particularly business-hours regularity vs. off-pattern sends
Domain age of the external party relative to the first contact date in our records

That last one matters more than people expect. A lookalike domain registered 11 days before an attack appears with zero prior communication history. No relationship edge exists. The graph treats it as a stranger, regardless of how convincing the display name looks or how carefully the message body was crafted.

Why Content Scoring Misses This Class of Attack

LLM-based content scoring is good. We use it, and it catches a meaningful percentage of phishing attempts. But it has a known failure mode: a well-crafted impersonation email can score low-risk on content alone. The body is grammatically correct, the tone matches the supposed sender's style, there are no suspicious URLs, no malware attachments — nothing the content model can anchor on as a signal.

Display-name spoofing is the clearest example. An attacker crafts a message that shows "Michael Chen" in the From field, matching the actual CFO's name, but the sending address is [email protected], a domain registered 3 weeks ago. The body asks for a wire transfer approval, uses the right internal jargon, and references a real deal the company is working on. Believable. Targeted.

Content scoring sees: plausible prose, correct terminology, no malicious payload. Score: medium-low risk.

The graph sees: no prior relationship edge between this recipient and that sending domain. Zero messages in 90 days. Relationship confidence: absent. Score: high anomaly. Alert fired.

Independent layers. That's the design principle. In our experience, the attacks that beat one layer almost never beat both. A zero-relationship-history sender writing a contextually plausible message is exactly the pattern that graph anomaly detection was built to surface.

Operationalizing the Graph as a Detection Primitive

The relationship graph is an engineering choice about what counts as a security primitive. A primitive, in this context, means a signal you can compose with other signals — a building block you can rely on being stable and independently meaningful regardless of downstream model changes.

In practice, this is how the graph feeds our detection pipeline:

Relationship confidence score — a 0-to-1 value representing how established the sending address and domain relationship is for this specific recipient. Computed from volume, recency, and reply depth.
Domain novelty flag — binary signal: has this sending domain appeared in this mailbox's receive history in the last 90 days? New domains from established vendors get flagged for analyst review, not auto-blocked.
Display-name mismatch detection — cross-reference the display name against known relationship edges. If "Michael Chen" has never sent from this domain in 90 days, that is a zero-confidence relationship regardless of how the name renders in the client.
Reply-chain injection detection — attackers sometimes insert themselves mid-thread. The graph tracks expected reply participants for a given thread hash cluster; a new sender joining an established thread with no prior relationship gets flagged for that thread context.

These four signals feed into the combined scoring model alongside the LLM content analysis. Neither layer has veto power alone. The final disposition is a weighted function of both. But in our data, the graph layer is responsible for flagging approximately 34% of confirmed BEC attempts that the content layer scored below our alert threshold. Not marginal. Decisive.

Edge Cases and Known Limitations

Honest note: the graph is not magic. It has failure modes worth understanding before you treat it as a reliable primitive.

New employees are the hardest case. Someone who joined 3 weeks ago has almost no relationship history. Every external sender looks like a stranger from the graph's perspective. We handle this with a role-based warmup period: new accounts have graph scoring weighted lower for the first 60 days, with heavier reliance on content scoring and domain reputation signals until the relationship baseline accumulates enough volume to be meaningful.

Vendors with high staff turnover present a related problem. If a vendor cycles through multiple sending addresses, the relationship graph treats each new address as zero-confidence even when the domain is well established. Our domain-level aggregation partially addresses this: if the sending domain has a strong history even when the specific address is new, the domain confidence score partially compensates. Partially. We tell customers explicitly that domain-level and address-level signals are tracked separately, and that both matter.

Attackers who do their homework sometimes try to solve the graph problem by identifying a dormant relationship — an address in the target's receive history that has been silent for 6 months — and spoofing that exact address. The timing distribution signal catches most of these. A sudden re-emergence after 180 days of silence triggers a dormancy anomaly flag regardless of how familiar the address looks to the recipient.

Practical note: Relationship graph effectiveness degrades sharply if the ingestion window is shorter than 60 days. Below that, you cannot reliably distinguish novelty from infrequent-but-legitimate contact. The 90-day window is not arbitrary — it reflects the minimum needed to baseline the vendor relationship patterns most relevant to BEC-category attacks.

Where the Graph Goes Next

The most interesting extension we're actively building is cross-mailbox graph correlation. Right now, the graph is computed per mailbox in isolation. But impersonation campaigns often target multiple employees simultaneously — a coordinated wave, not a single message. If five accounts payable employees all receive first-contact messages from the same unknown domain within a 48-hour window, that coordinated novelty signal is far stronger than any single mailbox's anomaly score considered alone.

Early results from our pilot group show a 2.8x improvement in early-stage BEC campaign detection when cross-mailbox correlation is enabled. The graph stops being just a per-user fingerprint and starts behaving more like a network-wide anomaly detector. That is the direction we're building toward.

The graph as a security primitive is a more durable idea than any specific model checkpoint. Models get retrained. Attacker tactics evolve. But the fundamental structure — that established communication relationships encode behavioral expectation, and that deviation from that expectation is a reliable signal — holds regardless of how the attack surface changes. That is what makes it worth building correctly, and worth treating as first-class infrastructure rather than a secondary feature layered on top of content analysis.