How does pseolint calibrate its verdicts?

pseolint adds predictive validity on top of face validity: it audits a curated corpus of in-production programmatic-SEO sites that demonstrably win in search, and treats any deviation between its verdict and a site's real-world ranking success as a bug in the engine, not the site. The corpus, runner, and regression tests are open-source and reproducible at a fixed sample seed.

The engine shifts the unit of analysis from URL to template. Phase 1 clusters sitemap URLs into templates (≥1% coverage, ≥5 URLs, ≥2 surviving templates). Phase 2 stratified-samples pages across templates — up to 200 (free / monitoring) or 500 (manual re-audit) — and runs every rule per template. The site verdict is the worst template verdict among templates covering ≥5% of URLs — so a tiny page can't tank the site and a dominant broken template can't hide behind a clean one.

Does pseolint measure domain authority?

No. pseolint is a static-content and link-graph analyzer — it reads what pages say, how they link, and how they nest. It does not check backlinks, brand mentions, domain age, or any external trust signal, and has no Moz/Ahrefs/Semrush dependency, by design, so it can run offline against a build directory. Pass --authority-score (0-100) to shift the verdict ladder for your authority tier.

What does a passing verdict mean?

A pass means pseolint's rules don't false-positive on shapes that high-authority sites successfully ship. A fail often means the engine correctly identified a real issue (duplicate titles, redirect chains, missing OG tags) that the site can absorb because of its authority — not that the engine is wrong.

What stays true regardless of which sites pseolint audits?

Doorway-pattern findings collapse to one cluster line instead of per-pair noise; sample-seed determinism makes verdicts reproducible (mulberry32 PRNG); info-severity findings are capped per bucket so they can't tank a verdict on their own; severity demotions are auditable via summary.appliedSeverityDemotions; and the spec, corpus, runner, and regression tests are open-source so anyone can re-run the calibration.

Methodology · v0.7.4 · for skeptical engineers

How pseolint's verdicts are calibrated.

Most SEO audit tools rely on face validity — "these rules look like they map to documented Google policy." pseolint adds predictive validity: we audit a curated corpus of in-production pSEO sites that demonstrably win in search, and treat any deviation between our verdict and the site's real-world ranking success as a bug in our engine, not in the site.

// pipeline

How audits work — two-phase template pipeline

pseolint pivots the unit of analysis from URL to template. A 100k-URL directory no longer averages findings across a flat 200-URL sample — instead, each template cluster is audited independently and produces its own verdict.

Phase 1 — Template detection

Cluster sitemap URLs by signature (e.g. /listing/:slug). Filter to clusters with ≥1% coverage of total discovered URLs and ≥5 URLs. Require ≥2 surviving templates to activate the template path — single-template sites fall through to the legacy per-URL view (spec §15.3). Cost: ~T HTTP fetches (T = template count, typically 5–20). Cheap.

Phase 2 — Per-template deep audit

Stratified-sample pages across templates — up to 200 (free / scheduled monitoring) or 500 (manual re-audit). Run all 47rules on each sample set. Compute per-template risk, verdict, and variance metric. The page budget is spread across templates so a dominant cluster can't crowd out coverage of the smaller ones.

Aggregation — site verdict

Site verdict = worst template verdict, filtered to templates with ≥5% URL coverage (spec §15.1). A tiny /about page at critical doesn't tank the site. A /listing/*template covering 97% of the site does. Templates below 5% still appear in the dashboard drill-down — they just don't drive the headline.

Variance metric — uniformity score

For each template: uniformity = 1 - mean(stdev(per-rule fire-rates)). High uniformity (≥0.7) = every sample has the same problems — template is broken uniformly, one structural fix helps all N pages. Low uniformity = problem is data-quality-dependent, not template-structural. Surfaces in the template card as a colour-coded bar (green / yellow / red at 0.7 / 0.4 thresholds).

sitemap URLs (100k)
        │
        ▼ Phase 1 — Template detection (~T fetches)
  clusterUrlTemplates()
        │  filter: ratio ≥ 1%, count ≥ 5, ≥ 2 survivors
        ▼
  Template[] { signature, totalUrls }
        │
        ▼ Phase 2 — Per-template deep audit (stratified sample)
  for each template:
    sample (stratified)  →  fetch + parse  →  run 47 rules
    compute: risk, verdict, uniformityScore, topDriver
        │
        ▼ Aggregation
  siteVerdict = worst(templates where coverage ≥ 5%)
  AuditResult { templates[], findings[], verdict, risk }

// documentation note

Read this as engineering reference. Not a testimonial page.

These sites: Public pSEO sites we audited without permission. Not pseolint customers. Have not endorsed pseolint, and we don't imply they have. Picked because they demonstrably win in search — useful as a ground-truth calibration target.
These numbers: Point-in-time engine-validation data for skeptical engineers. Not customer-success metrics. When pseolint has actual customer recovery stories, they will live elsewhere with explicit consent and attribution.
A pass means: Our rules don't false-positive on shapes that high-authority sites successfully ship. A fail often means the engine correctly identified a real issue (duplicate titles, redirect chains, missing OG tags) that the site can absorb because of authority — not that the engine is wrong. See "How to read our verdict" below.

License: MIT. github.com/ouranos-labs/pseolint
Reproducibility: --sample-seed 1729 — same seed, same audit, same verdict
Verifiability: bun run scripts/calibration-reputable-pseo.ts — reruns against the same corpus
Limitations: Documented inline. Trade-offs and blind spots, with roadmap fixes, in the sections below.

Last calibrated: May 3, 2026 · Engine: v0.7.4 · Ruleset version 12 · Sample seed 1729

Next refresh target: August 3, 2026 — quarterly cadence. Numerical results below are point-in-time; they will drift as sites redesign or as our engine evolves. Methodology and trade-offs (the durable claims) stay stable across refreshes.

The reputable-pSEO corpus

Twelve programmatic-SEO sites curated to span verticals (integration directories, currency-pair converters, template galleries, category directories, city-level cost-of-living indices) and ranking strength. Every entry has a documented "ground-truth ceiling" — the verdict pseolint must produce *or better* for the engine to be considered correctly calibrated. The corpus, the runner, and the regression tests are open-source.

corpus json →runner →full spec (9-round iteration story) →

Snapshot results

Captured 2026-05-03. Subject to drift; re-run any time.

zapier.com/apps/slack/integrations

integration directory

concerning

≤ cautionrisk 55

Real finding: tech/canonical-consistency × 8 (mixed error+info) on integration pages with tracking parameters.

g2.com/categories

software directory

ready

≤ cautionrisk 18

wise.com/us/currency-converter

currency pair

caution

≤ cautionrisk 26

webflow.com/templates

template gallery

caution

≤ cautionrisk 38

typeform.com/templates

template gallery

critical

≤ cautionrisk 61

Real finding: content/title-uniqueness × 6 — multiple template gallery cards share the exact same title. Google ranks Typeform despite this because of authority; lower-DA operators with the same pattern would be demoted.

segment.com/integrations

integration directory

critical

≤ cautionrisk 61

Real findings: spam/boilerplate-ratio × 25 (warning) + tech/redirect-chain × 25 (warning). Cumulative impact on integrity bucket pushes critical.

jasper.ai/templates

template gallery

caution

≤ cautionrisk 37

ramp.com/spend-management

category directory

ready

≤ cautionrisk 19

Three more corpus sites (nerdwallet, numbeo, stripe.com/atlas) are excluded from the snapshot because they consistently fail to fetch under our crawler — see "Excluded" below.

Three sites score worse than their ceiling on the snapshot date. In every case, the drivers are real findings, not engine noise:

Zapier (concerning) — tech/canonical-consistency fires on integration pages with tracking parameters that don't canonicalise back to the parameter-free URL. Real tech-debt.
Typeform (critical) — content/title-uniqueness fires on 6 template-gallery cards that share the exact same title (templates with different content but identical heading text). The new title-uniqueness rule (v0.5.2) caught this. A site without Typeform's authority would lose rankings on those duplicate-titled pages.
Segment (critical) — spam/boilerplate-ratio and tech/redirect-chain both fire on 25 of 41 sampled pages. Catalog-shape findings, but real ones.

We did NOT update the ceiling on these sites to force a pass. That would be calibration laundering. The fail status reflects the engine's honest reading; what these sites get away with is tied to authority, not content quality. Pass --authority-score 80 when auditing these sites and the verdict shifts one tier lenient (concerning → caution; critical → concerning) — that's the bring-your-own-DA mechanism documented under Limitations above.

Trade-offs we accepted

Calibration is two-axis. We measured both.

+ reputable-pSEO axis

False-positive rate on reputable pSEO went from 33% → 78%over 9 calibration rounds. Cluster collapse killed the 276-line doorway noise. Sample-seed made verdicts reproducible. Info findings can't tank a verdict on their own.

− borderline-quality axis

On the secondary weak-pSEO dogfood corpus, two sites previously predicted caution now score ready: wordpress.com (defensible — polished marketing site) and expatistan.com (3-page sample artifact). One regression the other direction: nextjs.org ready → caution on real cross-domain canonical findings.

The primary failure mode of v0.5.1 was false-positives on real working pSEO sites — which destroyed credibility with the audience that actually uses the tool. The new false negatives are at the verdict-ladder boundary; neither says "you're great" while genuinely being a spam farm.

Full discussion in the v0.5.2 CHANGELOG's Trade-offs (read this) section.

Limitations: domain authority is a blind spot

pseolint does not measure domain authority.The engine is a static-content + link- graph analyzer. It reads what your pages say, how they link, and how they nest. It does not check backlinks, brand mentions, age, or any external trust signal that Google's quality systems weight heavily. There is no DA/DR/Moz/Ahrefs integration — by design, because pseolint is meant to be runnable offline against a build directory or a local dev server with no third-party API dependency.

This matters for how to read the verdict. The reputable-pSEO calibration corpus is biased toward high-authority domains — Zapier, G2, Wise, NerdWallet, Webflow are all well-established brands with strong backlink profiles. Their 200-word integration pages or template-gallery cards rank because they are those brands. The same content shape on a 6-month-old startup would be treated very differently by Google.

So:

If you operate at a comparable authority tier to the corpus (established brand, strong backlink profile, named editorial leadership), treat the verdict literally — a caution or worse is genuinely worth investigating.
If you are a newer or lower-authority operator, treat the verdict as a directional minimum, not a literal ceiling. Even a readyverdict on a structurally-fine page may not rank if your domain hasn't earned the authority to overcome thin or templated content. Conversely, fixing the issues pseolint flags is a necessary but not sufficient condition for ranking.

We intentionally do not fold third-party authority APIs into the engine because (a) it would make the tool dependent on paid SaaS, (b) the metrics differ across providers (Moz DA vs Ahrefs DR vs Semrush AS), and (c) Google uses signals none of those approximate well. Instead, bring-your-own-authority ships today: pass a normalized 0-100 authority score (the --authority-score flag, the MCP parameter, or a per-domain Pro setting) and the verdict ladder adjusts for your tier — the raw risk number is unchanged. Still future work: proxy-signal detection (domain age, internal-graph density, named editorial leadership) for callers without an authority figure of their own.

Tier-1 blind spots

What we don't detect, and where it's on the roadmap.

Domain authority

shipped v0.5.2

--authority-score lets callers shift verdict for high/low-DA tiers.

Core Web Vitals (LCP/INP/CLS)

roadmap planned

Needs render-time PerformanceObserver harness.

Open Graph metadata

shipped v0.5.2

tech/og-completeness ships with the credibility layer.

Title tag uniqueness

shipped v0.5.2

content/title-uniqueness — raw, not entity-masked.

H1 structure

shipped v0.5.2

content/heading-structure — presence, single-H1, hierarchy.

Image alt-text

shipped v0.5.2

content/image-alt-text — skips decorative images.

Tier-2 (workaround exists): Search Console integration, crawl-budget waste, schema-content drift, outbound-link health, search-intent alignment. Tier-3 narrow gaps and the full taxonomy in the blind-spots spec.

What stays true regardless of which sites we audit

Cluster collapse on doorway-pattern findings. A 276-pair audit shows 1 cluster line, not 276.
Sample-seed determinism. AuditOptions.sampleSeed makes verdicts reproducible across runs (mulberry32 PRNG).
Per-bucket info-severity cap. A flood of info findings can no longer fill the 100-cap bucket and tank the verdict on its own (cap separately at 50).
Auditable severity demotions. summary.appliedSeverityDemotions lists every rule whose severity was remapped on this audit. Pass --strict to disable demotion entirely.
Sampling-aware rules. links/unreachable-from-root skips on partial samples (it can't tell isolation from sample shape).
Open-source spec, corpus, runner, regression test. Anyone can re-run our calibration and verify these claims.

Excluded sites (and why)

nerdwallet.com/best/credit-cards— blocks our user-agent (403/CAPTCHA)
numbeo.com/cost-of-living— p95 ≈ 30s under audit load — safe-mode origin gate aborts
airbyte.com/connectors— p95 > 4× baseline under audit load — safe-mode origin gate aborts
stripe.com/atlas/states— URL returns 404 in our crawl

Future work

Shipped in v0.5.2: AuditOptions.authorityScore (CLI: --authority-score 0-100) — bring-your-own-DA verdict ladder shift. Pass-through to formatter callers via summary.appliedSeverityDemotions. What remains:

1. Proxy-signal detection for callers without an authority score: domain-age via WHOIS, internal-graph density, named editorial leadership presence. Lets the engine make a credible authority guess on its own when the caller doesn't pass --authority-score.
2. Classifier confidence on catalog directories. Only 1 of 9 audited reputable sites was classified as programmatic-directory at ≥70% confidence. Improving the classifier is the largest remaining lever inside the engine itself.
3. A weak-pSEO calibration corpus to balance the reputable one. Future runs would assert reputable-PASS and weak-FLAG simultaneously, preventing single-axis drift.
4. Core Web Vitals (LCP/INP/CLS) via render-mode page-load instrumentation. Currently in tier-1 of the blind-spot audit; needs a render-time PerformanceObserver harness.
5. Residential-IP rendering proxy to add Zillow, Yelp, TripAdvisor, Indeed (currently excluded for CAPTCHA/403 walls).
6. Verdict-explanation strings anchored to outcomes — "concerning means at this risk level, sites historically saw X% indexation loss after a Helpful Content rebuild" — once we have enough longitudinal data.

Verify any of this yourself

The corpus, the runner, and every per-round result are committed to the public repository. Clone it, run bun run scripts/calibration-reputable-pseo.ts, and confirm or refute the table above. We treat the runner's output as the source of truth — this page is just a dated mirror.

View on GitHub Run an audit

Sources

Google Search Central — Spam policies for Google web search — Google's spam policies are the primary reference for the eight SpamBrain-aligned rule categories that pseolint's calibration corpus tests against.
Google Search Central — Creating helpful, reliable, people-first content — Google's helpful-content guidance underpins the calibration ceiling logic: sites that demonstrably win in search are expected to meet the originality and usefulness criteria described here.
Google Search Central — Search Essentials — Google Search Essentials documents the technical requirements — canonical tags, crawlability, structured data — that several of pseolint's tech-bucket rules enforce and that the calibration data validates.