Email Security

How LLMs Detect Spear-Phishing That Rule-Based Engines Miss

May 1, 2025 Ravi Srinivasan 6 min read

Abstract visualization of LLM-based email threat analysis

Your email gateway blocked 99.3% of inbound threats last month. That remaining 0.7% is what we need to talk about. In our tracking across mid-market deployments, roughly 8 to 14% of targeted spear-phishing messages walk straight through Proofpoint and Mimecast without triggering a single rule. Not because those products are broken. Because the emails genuinely look fine, according to every indicator the platforms are built to check.

Why Rule-Based Engines Have a Structural Ceiling

Rule-based gateways are signature engines. They look for known-bad domains, attachment types, IP reputation scores, header anomalies, and DMARC alignment. These are the right things to check. They catch the bulk of commodity phishing and malware campaigns efficiently.

But here's the thing: a well-crafted spear-phishing message has none of those indicators. The sender domain is freshly registered or legitimately aged. The body references your CFO's name, mentions a recent acquisition your company announced, and asks for a wire confirmation that is procedurally plausible given your actual vendor relationships. No attachment. No suspicious link. Just text that contextually makes sense, from a surface that passes all authentication checks.

Rule engines have no model of what "contextually makes sense" means. That is not a bug in the design. It is a fundamental scope limitation. They were built to match patterns, not to reason about intent.

The gap is measurable. Not theoretical.

What Contextual Coherence Scoring Actually Does

When Phishaver inspects an inbound message, the LLM is not running a keyword scan. It is reasoning about the relationship between what is being requested, who is plausibly sending it, what the established communication history looks like, and whether the urgency framing fits any documented pattern of social engineering manipulation.

That last part matters more than most people expect. In our experience, the single most reliable indicator of a spear-phishing attempt is not content alone. It is the mismatch between content and established sender behavior. We build a 90-day relationship graph per mailbox. When a message arrives from your "CFO" asking for an urgent wire transfer, the system scores not just the message body, but whether this sender has historically initiated financial requests, what the typical communication pattern looks like between these two parties, and whether the tone and urgency signature match prior authentic exchanges.

Most targeted campaigns are constructed from LinkedIn data and public sources. The attacker knows your org chart. They know your vendors. They may know your procurement cycle. What they cannot fake is the fine-grained texture of 90 days of actual communication between two real people in your organization.

Intent scoring returns in 800ms. That is below the latency threshold where email delivery UX degrades. The detection happens before the message reaches the inbox, not as an after-the-fact alert.

The 22-Minute Problem

Here is what the latency actually costs when you miss one. In our data, the median dwell time from a successful credential-harvesting spear-phish to active credential use is 22 minutes. That is the window between "the user clicked the link" and "an attacker is logged into your environment." Twenty-two minutes.

Most SOC alert workflows do not operate at that speed. By the time an analyst triages a suspicious email report, investigates, confirms, and initiates a credential reset, that window has often already closed. The attacker is in.

This is not an argument against having analysts. It is an argument for not relying on post-delivery detection as your primary defense layer for targeted attacks. The email that bypasses your gateway generates an alert: after delivery, after it lands in the inbox, after the user has potentially already interacted with it. Pre-delivery intent scoring eliminates that exposure window entirely for the messages it catches.

Fact: the cost of a single successful BEC incident averages over $125,000 for mid-market companies, according to FBI IC3 data. The detection math here is not complicated.

Where This Fits in a Mid-Market Security Stack

We are not suggesting organizations replace their existing gateway. Proofpoint and Mimecast do essential work on high-volume, commodity threats. Running Phishaver alongside your existing MX stack means you are not re-solving problems that are already solved. You are adding coverage for the narrow but high-consequence attack category those tools were never designed to handle.

For teams in the 200 to 2,500 employee range, the reality is that you do not have a 20-person threat intel function reviewing every suspicious email. Your SOC, if you have a dedicated one, is handling alerts across your entire stack. Adding another high-noise detection layer creates fatigue and real operational cost.

The value of LLM-based intent scoring is specificity. Because it reasons about contextual coherence rather than matching indicators, its false positive rate on targeted attack detection is substantially lower than behavioral anomaly rules, which tend to fire on anything slightly unusual. In our experience, the difference in analyst time between a high-precision alert and a medium-precision alert is not marginal. It is the difference between an alert that gets acted on and one that gets deferred until morning.

Real talk: deferred until morning is how breaches happen.

What Missing a Spear-Phish Actually Looks Like

Honestly, the campaigns that concern us most are not the ones that look suspicious. Those get caught. The ones we built Phishaver to stop are indistinguishable from legitimate email on every surface-level check.

A real example from 2024: a campaign targeting mid-market finance teams used LinkedIn data about specific recipients, including their tenure, recent promotions, and reported relationships with executives. Emails were sent from aged domains with valid SPF, DKIM, and DMARC. The message body referenced a real vendor relationship by name, asked for a routine invoice approval, and included a link to a credential-harvesting page styled to match the target's actual bank portal. No attachment. No suspicious sender. No obvious urgency language.

Every rule-based check passed. The messages were contextually incoherent in one specific way: the urgency framing and the specific invoice amount did not match any prior communication pattern with that vendor. An LLM reasoning about 90 days of communication history catches that. A signature engine does not.

We have seen this pattern consistently across sectors. The attackers who target companies in the 500 to 2,000 employee range are not sending spray-and-pray malspam. They are doing real reconnaissance. They are writing emails that a human receiving them would find unremarkable. The only reliable detection path is a system that also reasons about context at that level of sophistication.

That is the gap. And it is not closing on its own.