SpamBrain rules — what pseolint detects.
pseolint runs 47 rules across 8 categories — spam-pattern detection (8 spam/*), AEO/answer-engine readiness (8 aeo/*), graph integrity (6 links/* including host-section-divergence, the May 2024 site-reputation-abuse detector), technical SEO (9 tech/*), content quality (7 content/*), structured data (3 schema/*), data-binding consistency (2 data/*), and cannibalization (1 cannibal/*). Every rule fires per sampled page and aggregates into a per-template verdict — not a per-URL list.
These rules cover programmatic-SEO patterns + AI Overview readiness. They don't replace a general SEO audit — for Core Web Vitals use PageSpeed Insights, and for broken-link scanning use Sitebulb ($35/mo) or Screaming Frog ($259/yr).
Featured deep-dive explainers below; full taxonomy, per-template aggregation model, and SpamBrain mapping further down.
- spam/thin-contentRead →
Thin Content Detection
Google's Helpful Content System (rebuilt August 25, 2022) demoted an estimated 45% of low-effort pages in the March 5, 2024 scaled-content-abuse update — the spam/thin-content rule mirrors that floor by flagging every URL under 300 words of substantive body text (default), after stripping nav and footer chrome via SpamBrain-style readability heuristics.
- spam/doorway-patternRead →
Doorway Pages
Google has banned doorway pages since the March 16, 2015 Search Central post — pseolint's spam/doorway-pattern rule mirrors SpamBrain's convergence logic by requiring 3 independent signals to stack (SimHash near-duplicate above 0.85, entity-swap, and structural confirmation) before firing at error severity (weight 25), the highest-confidence spam pattern reported by @pseolint/core v0.4.3.
- spam/near-duplicateRead →
Near-Duplicate Pages
85% SimHash similarity is the pseolint default threshold — every page pair at or above that mirrors the near-duplicate canonicalisation ceiling Google's web indexing team has used since adopting Charikar's 2002 SimHash paper in 2007, and which the March 5, 2024 scaled-content-abuse update reaffirmed as policy via SpamBrain's 60-second triage queue.
- spam/boilerplate-ratioRead →
Boilerplate Ratio
60% is the default boilerplateMaxRatio: pseolint identifies sentence-level blocks appearing on 80%+ of pages, then flags any URL whose word count is dominated by those repeated blocks (warning severity, weight 12).
- spam/template-diversityRead →
Template Diversity
30% is the default minUniqueRatio threshold — pseolint warns when fewer than 30% of pages carry a structurally distinct HTML skeleton, the floor at which SpamBrain (rebuilt August 25, 2022) starts reading a domain as one template rather than N designed pages.
- links/host-section-divergenceRead →
Site Reputation Abuse
Google's May 7, 2024 site-reputation-abuse policy demotes subfolders that borrow a host's reputation without earning it — links/host-section-divergence flags a URL section (e.g. /coupons/, /deals/) only when it diverges from the rest of the host on at least 2 of 4 independent structural signals, and it deliberately fires on the minority section, never on a balanced multi-topic split.
- spam/entity-swapRead →
Entity-Swap Pages
spam/entity-swap masks the variable noun on every page — by default US state names and 5-digit ZIP codes — then computes a 64-bit SimHash of what is left and fires at critical severity when two pages score 95% similarity or higher, the convergence signal Google's SpamBrain has used against entity-swap doorways since the March 5, 2024 scaled-content-abuse update.
- spam/publication-velocityRead →
Publication Velocity
spam/publication-velocity groups your pages by publish date and warns when any single day exceeds the greater of 100 pages or 10% of your whole corpus — the date-stacking signal Google's March 27, 2026 core update tightened against programmatically generated sites.
- spam/template-coverageRead →
Template Coverage
spam/template-coverage groups URLs in the same directory, masks the entity tokens in each filename, and reports how many of the possible dimension combinations a template actually fills — surfacing, at info severity, the sparse high-dimension matrices Google's March 27, 2026 core update down-weighted on programmatic sites.
- content/unique-valueRead →
Unique Value
content/unique-value scores how original each page is as a rarity density — every distinct word weighted by how rare it is across the audit, then averaged — and fires when that density falls below the floor, the page-specific-vocabulary test Google's scaled-content-abuse policy has applied since March 5, 2024 when it asks whether a URL adds anything genuinely new.
- content/meta-uniquenessRead →
Meta Description Uniqueness
content/meta-uniqueness masks the entity tokens in every page's meta description, lower-cases and trims what remains, and fires an error the moment two or more pages collapse to the same string — the templated-snippet pattern Google has treated as scaled content since the March 5, 2024 spam update.
- content/missing-authorRead →
Missing Author
Google added the second E for Experience to its E-A-T trust framework on December 15, 2022, and content/missing-author mirrors that shift by flagging at warning severity, medium confidence, every page that exposes none of four author signals — a meta author tag, a schema author field, a byline element, or a rel=author link.
- content/eeat-signalsRead →
E-E-A-T Signals
content/eeat-signals checks four trust categories on every page — an about-page link, an author byline, a published date, and a sources or references marker — then fires at info severity for any URL carrying fewer than 2 of the 4, the anonymity pattern Google's E-E-A-T framework has weighed against pages since its December 2022 Quality Rater Guidelines update.
- content/title-uniquenessRead →
Title Uniqueness
content/title-uniqueness rolls three checks into one rule — a missing or empty title, a title outside the 10-to-70-character band, and two or more pages sharing the exact same raw title — and it raised this gap to a tier-1 fix in pseolint after the 2026-05-03 blind-spot audit because Google ranks the title above every other on-page element.
- content/heading-structureRead →
Heading Structure
content/heading-structure runs three checks on every page Google crawls — a missing H1 fires an error because it is almost always a CMS or template bug, two or more H1 elements raise a warning that the HTML5 outline and accessibility checkers both dislike, and any page past 600 words with no H2 sub-structure emits an info note about Featured Snippet eligibility.
- content/image-alt-textRead →
Image Alt Text
content/image-alt-text scans every <img> tag on a page, skips images you have explicitly marked decorative, and reports each URL where a content-bearing image carries no alt attribute at all — the accessibility gap WCAG 2.1 has required closing under success criterion 1.1.1 since June 5, 2018 and the one that keeps a page out of Google Images.
- links/orphan-pagesRead →
Orphan Pages
links/orphan-pages scans every URL in the crawl, counts the inbound internal links pointing at each one, and fires at error severity on any page with exactly 0 of them — the dead-zone shape that leaves Googlebot unable to reach a URL through your own navigation, a structural gap the March 27, 2026 core update treats as a discoverability failure rather than a content one.
- links/dead-endsRead →
Dead Ends
links/dead-ends flags every crawled page (the homepage aside) whose outbound links include zero URLs that point to another page in the same crawl, the forward-flow gap that strands Googlebot and traps link equity, a warning a model-railway shop's 1,400 product listings hit when each turnout and locomotive page links only out to a vendor, never deeper into the store.
- links/link-depthRead →
Link Depth
links/link-depth runs a breadth-first search from your root URL and measures the shortest click-distance to every page, flagging anything past the default ceiling of 3 clicks as info and anything Googlebot cannot reach from the root at all as a warning, because a page Google crawls last is a page Google ranks last.
- links/cluster-connectivityRead →
Cluster Connectivity
links/cluster-connectivity groups every crawled URL by its parent directory, and for each cluster of 2 or more pages it checks whether a single internal crawl link enters from another cluster or leaves toward one — firing a warning when neither exists, because Google cannot diffuse authority into a directory that no other section of your site references or is referenced by.
- cannibal/url-patternRead →
URL Pattern Cannibalization
cannibal/url-pattern splits each URL's last slug on hyphens, sorts the tokens, and flags at info severity any two pages in the same directory whose sorted token sets match exactly — the reordered-slug keyword cannibalization Google has resolved by collapsing competing URLs to one canonical result since well before its March 2026 core update.
- aeo/freshness-signalsRead →
Freshness Signals
aeo/freshness-signals checks every page for a real modification signal — a JSON-LD dateModified, an article:modified_time meta tag, or a visible 'Last updated' line — warns at medium confidence when none exists, then drops to an info note when the best date it can parse is older than the staleness default of 180 days Google has long associated with how AI Overviews weigh recency.
- aeo/llms-txtRead →
llms.txt
llms.txt is a draft, low-adoption convention proposed in September 2023 and championed by Jeremy Howard at Answer.AI, so pseolint runs this as a low-confidence, informational site-level check that fetches /llms.txt once at your origin and verifies 3 shape rules, treating a missing file as a missed opportunity worth roughly 1 hour of work, never a defect.
- aeo/crawler-accessRead →
Crawler Access
aeo/crawler-access parses your robots.txt user-agent by user-agent and checks 8 named AI crawlers — GPTBot from OpenAI, ClaudeBot from Anthropic, PerplexityBot, Google-Extended, and four more — warning once per fully blocked bot and escalating to an error only when every one is disallowed, so blocking them stays a deliberate choice you make, not a verdict the rule hands down.
- aeo/faq-coverageRead →
FAQ Coverage
aeo/faq-coverage flags any page that reads like an FAQ — at least 2 question-phrased H2 headings starting with how, what, or why, or a /faq, /how-to, or /what-is URL path — yet ships no FAQPage or HowTo JSON-LD, the structured-data gap that matters far more for AI extraction since Google narrowed FAQ rich results to government and health sites in August 2023.
- aeo/summary-baitRead →
Summary Bait
aeo/summary-bait fires when 70% or more of a page's citable facts are crammed into its first 150 words and nothing fresh waits below, a low-confidence warning that the page is shaped for an AI Overviews snippet Google can lift whole rather than for a reader who scrolls past the opener.
- content/translation-no-opRead →
Translation No-Op
content/translation-no-op groups URLs that differ only by a leading locale segment like /en/ or /fr/, computes a 64-bit SimHash of each extracted body, and fires an error the moment any pair scores at or above 95% similarity — the fake-i18n pattern Google has told site owners to fix with real hreflang pairs, not duplicated English.
- content/regurgitated-contentRead →
Regurgitated Content
content/regurgitated-content is a low-confidence v1 heuristic that fires a warning when a page shows at least 2 of 5 Google-Places-regurgitation tells — Powered by Google attribution, googleusercontent images over 60%, a Static Maps embed, Places API JavaScript, or an aggregator footprint of 5 or more unsigned star-rating blocks.
- content/common-phrase-reuseRead →
Common Phrase Reuse
content/common-phrase-reuse scans each page against a bundled list of roughly 42 pSEO marketing clichés grouped into 5 categories — location filler, generic-marketing superlatives, aggregator phrasing, fake-authority claims, and filler hedges — and raises one low-confidence warning the moment 3 or more distinct phrases from that list appear, a speculative density signal Google's helpful-content guidance has weighted since 2024.
- content/wikipedia-paraphraseRead →
Wikipedia Paraphrase
content/wikipedia-paraphrase fires a low-confidence warning the moment a page shares 40% or more of its three-word phrases with a bundled Wikipedia reference corpus, the trigram-overlap point at which Google's helpful-content framing reads a URL as reworded encyclopedia rather than the original analysis a March 2024 audit rewards.
- content/value-addRead →
Value-Add Score
content/value-add is a second-pass composite that reads seven other rules' findings — originality, freshness, citable facts, the four-category E-E-A-T count, translation, cliche reuse, and Wikipedia paraphrase — weights each at one-seventh, averages them into a single 0-to-1 score, and fires an error below 50% or a critical below 30%, the synthesis SpamBrain has rewarded since the March 5, 2024 update.
Per-template aggregation — how rules feed verdicts
The engine audits by template rather than by URL. Phase 1 detects URL templates (filter ≥1% of URLs, ≥5 URLs, ≥2 survivors after deduplication). Phase 2 samples pages stratified across templates and runs all 47rules. Each rule's output per template is summarised as a uniformity score (0–1) and a top driver — the single rule responsible for the most findings on that template. The site verdict is determined by siteVerdictFromTemplates: the worst template that covers ≥5% of the site's URLs (spec §15.1). Three aggregation patterns apply:
- Per-page → template uniformity score (most spam/* and content/* rules). The rule fires on each sampled page; the fire rate becomes the template's score for that rule.
- Corpus-wide (
spam/near-duplicate). Computed across all sampled pages regardless of template — it surfaces cross-template duplication, not just within a single template. - Per-page → template-level signal (aeo/* rules). A rule that fires on 8/10 pages of a template reports one template-level finding, not 8 individual URL findings.
How the rules map to SpamBrain
The rule set clusters around the major axes Google's SpamBrain classifier scores against. Spam/* (8 rules) covers the patterns the March 27, 2026 core update demotes most aggressively — the most recent classifier shift to hit pSEO, tightening scaled-content signals on date-stacked corpora — building on the March 5, 2024 scaled-content-abuse update that first targeted thin content under 300 words, doorway clusters with shared boilerplate, near-duplicate templates with >85% lexical overlap, templates that don't vary their structural skeleton, and corpus-aware publication-velocity (the threshold scales with corpus size, so a 50,000-page directory and a 50-page blog get appropriate cutoffs). Content/* (4 rules) checks unique value, meta-description uniqueness after entity masking, author signals, and E-E-A-T markers. Aeo/* (8 rules, shipped April 21, 2026) audits answer-engine readiness — citable facts, atomic Q&A blocks, freshness signals, AI-crawler access, and the things Perplexity, ChatGPT, and Google's AI Overviews actually extract.
The remaining categories are links/* (6 rules — orphan pages, dead ends, cluster connectivity, link depth, unreachable-from-root, and host-section-divergence — the last one detects sub-sections that ride a host's reputation without integrating into it, which is the May 2024 site-reputation-abuse policy target), tech/* (4 rules — canonical consistency, sitemap completeness, soft-404, and redirect chains), schema/* (3 rules — JSON-LD validity, required-fields by type, and cross-page consistency), data/* (2 rules — missing-binding and identical-across-pages, fired when --data-source is set), and cannibal/* (1 rule — url-pattern; title-overlap and keyword-collision were dropped in v0.4 due to high false-positive rates).
What makes a rule "AEO-aligned"
The 2022 SpamBrain rebuild changed what enforcement looks like — instead of waiting for a manual reviewer to hit a domain with a policy action, the classifier silently suppresses pages it scores as spam-like at query time. That means the old "wait for the manual action notice" playbook is dead; you have to anticipate the scoring. An AEO-aligned rule is one whose detection logic also predicts whether AI Overviews and answer engines will cite the page — because the same signals (entity grounding, citable facts, atomic structure, schema integrity) drive both classical ranking and extraction by LLM-powered SERPs.
The full rule registry is open source at github.com/ouranos-labs/pseolint, and Google's underlying spam policies are documented at developers.google.com/search/docs/essentials/spam-policies. Every rule in pseolint links back to the specific policy paragraph it implements, so you can see exactly which Google guideline a finding maps to.
Run the full rule set on your site
The rules above are the ones most likely to fire on a templated site. The fastest way to see which ones actually fire on yours — and which template they're dragging down — is to run a free audit. No account required, results in under sixty seconds, per-template verdict included.
Each rule ships as an independent ESM module with deterministic fingerprinting, configurable thresholds via pseolint.config.ts, and a documented severity ladder (info → warning → error → critical) that maps to fixed integer penalty weights consumed by the composite-score reducer in packages/core/auditor.ts.
Provenance footnote: ruleId namespaces are stable contract from v0.4 forward; reintroduced rules retain their identifier or get a version-suffixed sibling. Suppression by classification is opt-out via --strict.