Rule referencespam/entity-swap

Entity-Swap Pages — When Only the Noun Changes Between URLs

spam/entity-swap masks the variable noun on every page — by default US state names and 5-digit ZIP codes — then computes a 64-bit SimHash of what is left and fires at critical severity when two pages score 95% similarity or higher, the convergence signal Google's SpamBrain has used against entity-swap doorways since the March 5, 2024 scaled-content-abuse update.

Test your site for entity-swap pages — when only the noun changes between urls

Loading bot check… if this doesn't resolve in a few seconds, refresh the page.

We'll surface findings tagged with `spam/entity-swap`.

What it detects

spam/entity-swap is the rule that catches the single cleanest fingerprint of programmatic generation: a page whose only real difference from its siblings is the entity you swapped in. The rule masks every page's main content with your entity patterns — the defaults cover all 50 US state names and 5-digit ZIP codes, and you add your own dimensions (cities, SKUs, job titles) in pseolint.config.ts — and then computes a 64-bit SimHash over the masked text.

Masking is what separates this rule from spam/near-duplicate. Near-duplicate hashes the raw text and fires at 85%, so two location pages with genuinely different city paragraphs can slip under its bar. Entity-swap removes the entity tokens first, so if the remaining sentence frames are identical the masked similarity rockets toward 100%. The pairwise O(n²) sweep flags any pair scoring 95% or above at critical severity, and records the pair as a PairMatch that spam/doorway-pattern later consumes as one of the three signals it needs to converge.

Why it matters

An entity-swap pair is the hardest pattern to defend because it admits what it is. When /plumbers/ohio and /plumbers/nevada say the same thing in the same order with two words changed, there is no argument that the second page serves a need the first does not. Google's classifiers treat the masked-similarity signal as near-conclusive precisely because the false-positive rate is so low — real local pages diverge once you remove the place name, and generated ones do not.

The 95% floor is deliberately conservative so the rule rarely cries wolf, which means a finding is worth acting on the day it appears. Field reports after the March 5, 2024 rollout showed entity-swap clusters losing the bulk of their long-tail impressions inside a 6-week window, and because the pairs feed spam/doorway-pattern, an unaddressed entity-swap problem tends to escalate from a quiet near-duplicate warning into the critical doorway stack that draws manual review.

A page that fails

/grants/small-business-grants-texas and /grants/small-business-grants-florida. Strip 'Texas' and 'Florida' and the two pages are byte-for-byte identical: same 'How to qualify' intro, same three eligibility bullets, same 'Apply before the deadline' close. Masked SimHash similarity 99%. The rule fires at critical and hands the pair to spam/doorway-pattern, where the identical structure and shared meta description complete the three-signal stack.

A page that passes

/grants/small-business-grants-texas and /grants/small-business-grants-florida, rebuilt from a state grants dataset. The Texas page leads with the Texas Enterprise Fund and a franchise-tax exemption; the Florida page leads with the absence of a state income tax and county-level economic-development grants. Different agencies, different dollar amounts, different deadlines. Masked similarity drops to 38% because the sentence frames themselves now differ, not just the state name — and the entity-swap pair never forms.

How to fix it

  1. 1Bind real per-entity data, not synonyms. Swapping 'top' for 'best' or rewording a sentence leaves the masked SimHash untouched; the rule already ignores the entity token, so only genuinely different facts move the score.
  2. 2Lead each page with the one thing that entity has and its siblings lack — a local statute, a region-specific fee, a SKU's actual spec — so the opening sentence frame diverges, not just the noun.
  3. 3Audit your data source for thin records. An entity-swap cluster usually traces back to rows that carry no distinguishing fields; if the data cannot differentiate the page, the page probably should not exist as a separate URL.
  4. 4Consolidate entities you cannot differentiate. Five states with identical programs are better served by one page that names all five than five pages that pretend to be different.
  5. 5Re-run after each fix. Because the rule is pairwise, breaking one page out of a cluster can drop several findings at once as the remaining pairs fall below 95%.

SpamBrain context

Entity masking mirrors how Google's deduplication has worked since it adopted SimHash-style fingerprinting for crawl: the index does not care which proper noun you inserted, it cares whether the document adds anything the rest of the web lacks. The March 5, 2024 scaled-content-abuse policy named 'creating many pages where little changes between them' as a violation in its own right, independent of whether a human or a model produced the text.

spam/entity-swap (shipped in @pseolint/core, MIT-licensed at github.com/ouranos-labs/pseolint) operationalises that clause with the strictest threshold in the spam family — 95% on masked text versus 85% on raw text for spam/near-duplicate — so it surfaces the pattern the policy targets without flagging legitimately templated pages that vary their content. It is one of the three independent signals spam/doorway-pattern requires, which is why clearing entity-swap findings early is the cheapest way to keep a programmatic template out of the critical doorway tier.

Frequently asked questions

How is entity-swap different from near-duplicate?
Near-duplicate hashes your raw text and fires at 85% similarity; entity-swap masks the variable noun first — by default US state names and ZIP codes — then hashes what remains and fires at the stricter 95%. The masking is the whole point: two pages can have city paragraphs different enough to pass near-duplicate while being identical sentence frames once the city name is removed. Entity-swap exists to catch exactly that case.
What does pseolint mask by default, and can I add my own entities?
The defaults cover all 50 US state names and 5-digit ZIP codes. You add your own dimensions — cities, product SKUs, job titles, company names — as regex patterns in the entityPatterns option or pseolint.config.ts. The patterns you declare are exactly the variables your template swaps, so masking them is how you tell the rule which axis to ignore while it judges whether anything else changes.
Why does the rule fire at critical instead of warning?
Because the false-positive rate is very low. A pair that is 95% similar after the entity is removed is, by construction, two pages that say the same thing about different nouns — the textbook doorway shape Google's policy describes. Genuine per-entity pages diverge the moment you mask the entity, so they never reach the threshold. That high confidence is why entity-swap is one of the signals that can push a template into the critical doorway stack.
I have real local pages that still trip this. What now?
Look at what your pages actually say once the place name is gone. If the answer is 'the same thing', the locality is cosmetic and the rule is correct — add genuinely local facts (regulations, pricing, named providers) or consolidate. If your pages truly differ and still trip, your entity pattern is probably too narrow, leaving other shared nouns unmasked; widen the patterns so the rule judges the right axis.
Does fixing entity-swap also clear my doorway findings?
Often, yes. spam/doorway-pattern only fires when three signals converge, and entity-swap is one of them. Breaking the masked similarity below 95% removes that signal from the stack, which is frequently enough to drop the pair below the three-signal threshold even if it still trips near-duplicate. Fixing entity-swap is usually the cheapest way to dismantle a doorway cluster.
We run a real multi-location veterinary group — will this rule punish us?
Only if your clinic pages are interchangeable. A genuine veterinary group differentiates each location on its on-site surgical suite, its emergency feline-and-canine triage hours, its boarding-kennel capacity, and the named vets who practise there. Mask the town and those pages still diverge, so the entity-swap pair never assembles. If masking leaves identical vaccination-schedule boilerplate behind, the rule is correctly telling you the locations exist only on paper. A mobile farrier who lists every locality where he shoes horses, repeating one hoof-trimming blurb per page, is the equine version of the same trap.

Related rules

Want to know whether this rule actually fires on your site?

Run pseolint against your sitemap. The audit is free, takes about a minute, and returns a per-URL list of every rule that fired — including this one — with the exact metric values so you can prioritise the fix queue.