SpamBrain rules — what pseolint detects.
pseolint v0.6 runs 32 rules across 8 categories — spam-pattern detection (8 spam/*), AEO/answer-engine readiness (8 aeo/*), graph integrity (6 links/* including host-section-divergence, the May 2024 site-reputation-abuse detector), technical SEO (4 tech/*), content quality (4 content/*), structured data (3 schema/*), data-binding consistency (2 data/*), and cannibalization (1 cannibal/*). In v0.6, every rule fires per sampled page and aggregates into a per-template verdict — not a per-URL list.
These rules cover programmatic-SEO patterns + AI Overview readiness. They don't replace a general SEO audit — for Core Web Vitals use PageSpeed Insights, and for broken-link scanning use Sitebulb ($35/mo) or Screaming Frog ($259/yr).
Featured deep-dive explainers below; full taxonomy, per-template aggregation model, and SpamBrain mapping further down.
- spam/thin-contentRead →
Thin Content Detection
Google's Helpful Content System (rebuilt August 25, 2022) demoted an estimated 45% of low-effort pages in the March 5, 2024 scaled-content-abuse update — the spam/thin-content rule mirrors that floor by flagging every URL under 300 words of substantive body text (default), after stripping nav and footer chrome via SpamBrain-style readability heuristics.
- spam/doorway-patternRead →
Doorway Pages
Google has banned doorway pages since the March 16, 2015 Search Central post — pseolint's spam/doorway-pattern rule mirrors SpamBrain's convergence logic by requiring 3 independent signals to stack (SimHash near-duplicate above 0.85, entity-swap, and structural confirmation) before firing at error severity (weight 25), the highest-confidence spam pattern reported by @pseolint/core v0.4.3.
- spam/near-duplicateRead →
Near-Duplicate Pages
85% SimHash similarity is the pseolint default threshold — every page pair at or above that mirrors the near-duplicate canonicalisation ceiling Google's web indexing team has used since adopting Charikar's 2002 SimHash paper in 2007, and which the March 5, 2024 scaled-content-abuse update reaffirmed as policy via SpamBrain's 60-second triage queue.
- spam/boilerplate-ratioRead →
Boilerplate Ratio
60% is the default boilerplateMaxRatio: pseolint identifies sentence-level blocks appearing on 80%+ of pages, then flags any URL whose word count is dominated by those repeated blocks (warning severity, weight 12).
- spam/template-diversityRead →
Template Diversity
30% is the default minUniqueRatio threshold — pseolint warns when fewer than 30% of pages carry a structurally distinct HTML skeleton, the floor at which SpamBrain (rebuilt August 25, 2022) starts reading a domain as one template rather than N designed pages.
Per-template aggregation — how rules feed verdicts
In v0.6, the engine audits by template rather than by URL. Phase 1 detects URL templates (filter ≥1% of URLs, ≥5 URLs, ≥2 survivors after deduplication). Phase 2 samples K=10 URLs per template and runs all 32 rules. Each rule's output per template is summarised as a uniformity score (0–1) and a top driver — the single rule responsible for the most findings on that template. The site verdict is determined by siteVerdictFromTemplates: the worst template that covers ≥5% of the site's URLs (spec §15.1). Three aggregation patterns apply:
- Per-page → template uniformity score (most spam/* and content/* rules). The rule fires on each sampled page; the fire rate becomes the template's score for that rule.
- Corpus-wide (
spam/near-duplicate). Computed across all sampled pages regardless of template — it surfaces cross-template duplication, not just within a single template. - Per-page → template-level signal (aeo/* rules). A rule that fires on 8/10 pages of a template reports one template-level finding, not 8 individual URL findings.
How the rules map to SpamBrain
The rule set clusters around the major axes Google's SpamBrain classifier scores against. Spam/* (8 rules) covers the patterns the March 27, 2026 core update demotes most aggressively — the most recent classifier shift to hit pSEO, tightening scaled-content signals on date-stacked corpora — building on the March 5, 2024 scaled-content-abuse update that first targeted thin content under 300 words, doorway clusters with shared boilerplate, near-duplicate templates with >85% lexical overlap, templates that don't vary their structural skeleton, and corpus-aware publication-velocity (the threshold scales with corpus size, so a 50,000-page directory and a 50-page blog get appropriate cutoffs). Content/* (4 rules) checks unique value, meta-description uniqueness after entity masking, author signals, and E-E-A-T markers. Aeo/* (8 rules, shipped April 21, 2026) audits answer-engine readiness — citable facts, atomic Q&A blocks, freshness signals, AI-crawler access, and the things Perplexity, ChatGPT, and Google's AI Overviews actually extract.
The remaining categories are links/* (6 rules — orphan pages, dead ends, cluster connectivity, link depth, unreachable-from-root, and host-section-divergence — the last one detects sub-sections that ride a host's reputation without integrating into it, which is the May 2024 site-reputation-abuse policy target), tech/* (4 rules — canonical consistency, sitemap completeness, soft-404, and redirect chains), schema/* (3 rules — JSON-LD validity, required-fields by type, and cross-page consistency), data/* (2 rules — missing-binding and identical-across-pages, fired when --data-source is set), and cannibal/* (1 rule — url-pattern; title-overlap and keyword-collision were dropped in v0.4 due to high false-positive rates).
What makes a rule "AEO-aligned"
The 2022 SpamBrain rebuild changed what enforcement looks like — instead of waiting for a manual reviewer to hit a domain with a policy action, the classifier silently suppresses pages it scores as spam-like at query time. That means the old "wait for the manual action notice" playbook is dead; you have to anticipate the scoring. An AEO-aligned rule is one whose detection logic also predicts whether AI Overviews and answer engines will cite the page — because the same signals (entity grounding, citable facts, atomic structure, schema integrity) drive both classical ranking and extraction by LLM-powered SERPs.
The full rule registry is open source at github.com/ouranos-labs/pseolint, and Google's underlying spam policies are documented at developers.google.com/search/docs/essentials/spam-policies. Every rule in pseolint links back to the specific policy paragraph it implements, so you can see exactly which Google guideline a finding maps to.
Run the full rule set on your site
The rules above are the ones most likely to fire on a templated site. The fastest way to see which ones actually fire on yours — and which template they're dragging down — is to run a free audit. No account required, results in under sixty seconds, per-template verdict included.
Each rule ships as an independent ESM module with deterministic fingerprinting, configurable thresholds via pseolint.config.ts, and a documented severity ladder (info → warning → error → critical) that maps to fixed integer penalty weights consumed by the composite-score reducer in packages/core/auditor.ts.
Provenance footnote: ruleId namespaces are stable contract from v0.4 forward; reintroduced rules retain their identifier or get a version-suffixed sibling. Suppression by classification is opt-out via --strict.