How pseolint's verdicts are calibrated.
Most SEO audit tools rely on face validity — "these rules look like they map to documented Google policy." pseolint adds predictive validity: we audit a curated corpus of in-production pSEO sites that demonstrably win in search, and treat any deviation between our verdict and the site's real-world ranking success as a bug in our engine, not in the site.
// v0.6 pipeline
How v0.6 audits work — two-phase template pipeline
v0.6 pivots the unit of analysis from URL to template. A 100k-URL directory no longer averages findings across a flat 200-URL sample — instead, each template cluster is audited independently and produces its own verdict.
Phase 1 — Template detection
Cluster sitemap URLs by signature (e.g. /listing/:slug). Filter to clusters with ≥1% coverage of total discovered URLs and ≥5 URLs. Require ≥2 surviving templates to activate the v0.6 path — single-template sites fall through to the legacy per-URL view (spec §15.3). Cost: ~T HTTP fetches (T = template count, typically 5–20). Cheap.
Phase 2 — Per-template deep audit
Stratified-sample K=10 URLs per template (monitoring runs) or K=20 (manual re-audits). Run all 32 rules on each sample set. Compute per-template risk, verdict, and variance metric. Total fetches: T × K — typical T=8, K=10 = 80 fetches (vs 200 in v0.5 flat sampling) with full template coverage.
Aggregation — site verdict
Site verdict = worst template verdict, filtered to templates with ≥5% URL coverage (spec §15.1). A tiny /about page at critical doesn't tank the site. A /listing/*template covering 97% of the site does. Templates below 5% still appear in the dashboard drill-down — they just don't drive the headline.
Variance metric — uniformity score
For each template: uniformity = 1 - mean(stdev(per-rule fire-rates)). High uniformity (≥0.7) = every sample has the same problems — template is broken uniformly, one structural fix helps all N pages. Low uniformity = problem is data-quality-dependent, not template-structural. Surfaces in the template card as a colour-coded bar (green / yellow / red at 0.7 / 0.4 thresholds).
sitemap URLs (100k)
│
▼ Phase 1 — Template detection (~T fetches)
clusterUrlTemplates()
│ filter: ratio ≥ 1%, count ≥ 5, ≥ 2 survivors
▼
Template[] { signature, totalUrls }
│
▼ Phase 2 — Per-template deep audit (T × K fetches)
for each template:
sample K=10 URLs → fetch + parse → run 32 rules
compute: risk, verdict, uniformityScore, topDriver
│
▼ Aggregation
siteVerdict = worst(templates where coverage ≥ 5%)
AuditResult { templates[], findings[], verdict, risk }// documentation note
Read this as engineering reference. Not a testimonial page.
- These sites
- Public pSEO sites we audited without permission. Not pseolint customers. Have not endorsed pseolint, and we don't imply they have. Picked because they demonstrably win in search — useful as a ground-truth calibration target.
- These numbers
- Point-in-time engine-validation data for skeptical engineers. Not customer-success metrics. When pseolint has actual customer recovery stories, they will live elsewhere with explicit consent and attribution.
- A pass means
- Our rules don't false-positive on shapes that high-authority sites successfully ship. A fail often means the engine correctly identified a real issue (duplicate titles, redirect chains, missing OG tags) that the site can absorb because of authority — not that the engine is wrong. See "How to read our verdict" below.
- License
- MIT. github.com/ouranos-labs/pseolint
- Reproducibility
--sample-seed 1729— same seed, same audit, same verdict- Verifiability
bun run scripts/calibration-reputable-pseo.ts— reruns against the same corpus- Limitations
- Documented inline. Trade-offs and blind spots, with roadmap fixes, in the sections below.
Last calibrated: · Engine: v0.6.3 · Ruleset version 12 · Sample seed 1729
Next refresh target: — quarterly cadence. Numerical results below are point-in-time; they will drift as sites redesign or as our engine evolves. Methodology and trade-offs (the durable claims) stay stable across refreshes.
The reputable-pSEO corpus
Twelve programmatic-SEO sites curated to span verticals (integration directories, currency-pair converters, template galleries, category directories, city-level cost-of-living indices) and ranking strength. Every entry has a documented "ground-truth ceiling" — the verdict pseolint must produce *or better* for the engine to be considered correctly calibrated. The corpus, the runner, and the regression tests are open-source.
Snapshot results
Captured . Subject to drift; re-run any time.
zapier.com/apps/slack/integrations
integration directory
Real finding: tech/canonical-consistency × 8 (mixed error+info) on integration pages with tracking parameters.
g2.com/categories
software directory
wise.com/us/currency-converter
currency pair
webflow.com/templates
template gallery
typeform.com/templates
template gallery
Real finding: content/title-uniqueness × 6 — multiple template gallery cards share the exact same title. Google ranks Typeform despite this because of authority; lower-DA operators with the same pattern would be demoted.
segment.com/integrations
integration directory
Real findings: spam/boilerplate-ratio × 25 (warning) + tech/redirect-chain × 25 (warning). Cumulative impact on integrity bucket pushes critical.
jasper.ai/templates
template gallery
ramp.com/spend-management
category directory
Three more corpus sites (nerdwallet, numbeo, stripe.com/atlas) are excluded from the snapshot because they consistently fail to fetch under our crawler — see "Excluded" below.
Three sites score worse than their ceiling on the snapshot date. In every case, the drivers are real findings, not engine noise:
- Zapier (concerning) —
tech/canonical-consistencyfires on integration pages with tracking parameters that don't canonicalise back to the parameter-free URL. Real tech-debt. - Typeform (critical) —
content/title-uniquenessfires on 6 template-gallery cards that share the exact same title (templates with different content but identical heading text). The new title-uniqueness rule (v0.5.2) caught this. A site without Typeform's authority would lose rankings on those duplicate-titled pages. - Segment (critical) —
spam/boilerplate-ratioandtech/redirect-chainboth fire on 25 of 41 sampled pages. Catalog-shape findings, but real ones.
We did NOT update the ceiling on these sites to force a pass. That would be calibration laundering. The fail status reflects the engine's honest reading; what these sites get away with is tied to authority, not content quality. Pass --authority-score 80 when auditing these sites and the verdict shifts one tier lenient (concerning → caution; critical → concerning) — that's the bring-your-own-DA mechanism documented under Limitations above.
Trade-offs we accepted
Calibration is two-axis. We measured both.
+ reputable-pSEO axis
False-positive rate on reputable pSEO went from 33% → 78%over 9 calibration rounds. Cluster collapse killed the 276-line doorway noise. Sample-seed made verdicts reproducible. Info findings can't tank a verdict on their own.
− borderline-quality axis
On the secondary weak-pSEO dogfood corpus, two sites previously predicted caution now score ready: wordpress.com (defensible — polished marketing site) and expatistan.com (3-page sample artifact). One regression the other direction: nextjs.org ready → caution on real cross-domain canonical findings.
The primary failure mode of v0.5.1 was false-positives on real working pSEO sites — which destroyed credibility with the audience that actually uses the tool. The new false negatives are at the verdict-ladder boundary; neither says "you're great" while genuinely being a spam farm.
Full discussion in the v0.5.2 CHANGELOG's Trade-offs (read this) section.
Limitations: domain authority is a blind spot
pseolint does not measure domain authority.The engine is a static-content + link- graph analyzer. It reads what your pages say, how they link, and how they nest. It does not check backlinks, brand mentions, age, or any external trust signal that Google's quality systems weight heavily. There is no DA/DR/Moz/Ahrefs integration — by design, because pseolint is meant to be runnable offline against a build directory or a local dev server with no third-party API dependency.
This matters for how to read the verdict. The reputable-pSEO calibration corpus is biased toward high-authority domains — Zapier, G2, Wise, NerdWallet, Webflow are all well-established brands with strong backlink profiles. Their 200-word integration pages or template-gallery cards rank because they are those brands. The same content shape on a 6-month-old startup would be treated very differently by Google.
So:
- If you operate at a comparable authority tier to the corpus (established brand, strong backlink profile, named editorial leadership), treat the verdict literally — a
cautionor worse is genuinely worth investigating. - If you are a newer or lower-authority operator, treat the verdict as a directional minimum, not a literal ceiling. Even a
readyverdict on a structurally-fine page may not rank if your domain hasn't earned the authority to overcome thin or templated content. Conversely, fixing the issues pseolint flags is a necessary but not sufficient condition for ranking.
We intentionally do not fold third-party authority APIs into the engine because (a) it would make the tool dependent on paid SaaS, (b) the metrics differ across providers (Moz DA vs Ahrefs DR vs Semrush AS), and (c) Google uses signals none of those approximate well. Future work includes a bring-your-own-authority option (pass a normalized 0-100 authority score and have the verdict ladder adjust accordingly) and proxy-signal detection (domain age, internal-graph density, named editorial leadership) for callers without external data — see Future work below.
Tier-1 blind spots
What we don't detect, and where it's on the roadmap.
Domain authority
shipped v0.5.2--authority-score lets callers shift verdict for high/low-DA tiers.
Core Web Vitals (LCP/INP/CLS)
roadmap v0.6Needs render-time PerformanceObserver harness.
Open Graph metadata
shipped v0.5.2tech/og-completeness ships with the credibility layer.
Title tag uniqueness
shipped v0.5.2content/title-uniqueness — raw, not entity-masked.
H1 structure
shipped v0.5.2content/heading-structure — presence, single-H1, hierarchy.
Image alt-text
shipped v0.5.2content/image-alt-text — skips decorative images.
Tier-2 (workaround exists): Search Console integration, crawl-budget waste, schema-content drift, outbound-link health, search-intent alignment. Tier-3 narrow gaps and the full taxonomy in the blind-spots spec.
What stays true regardless of which sites we audit
- Cluster collapse on doorway-pattern findings. A 276-pair audit shows 1 cluster line, not 276.
- Sample-seed determinism.
AuditOptions.sampleSeedmakes verdicts reproducible across runs (mulberry32 PRNG). - Per-bucket info-severity cap. A flood of info findings can no longer fill the 100-cap bucket and tank the verdict on its own (cap separately at 50).
- Auditable severity demotions.
summary.appliedSeverityDemotionslists every rule whose severity was remapped on this audit. Pass--strictto disable demotion entirely. - Sampling-aware rules.
links/unreachable-from-rootskips on partial samples (it can't tell isolation from sample shape). - Open-source spec, corpus, runner, regression test. Anyone can re-run our calibration and verify these claims.
Excluded sites (and why)
nerdwallet.com/best/credit-cards— blocks our user-agent (403/CAPTCHA)numbeo.com/cost-of-living— p95 ≈ 30s under audit load — safe-mode origin gate abortsairbyte.com/connectors— p95 > 4× baseline under audit load — safe-mode origin gate abortsstripe.com/atlas/states— URL returns 404 in our crawl
Future work
Shipped in v0.5.2: AuditOptions.authorityScore (CLI: --authority-score 0-100) — bring-your-own-DA verdict ladder shift. Pass-through to formatter callers via summary.appliedSeverityDemotions. What remains:
- 1. Proxy-signal detection for callers without an authority score: domain-age via WHOIS, internal-graph density, named editorial leadership presence. Lets the engine make a credible authority guess on its own when the caller doesn't pass
--authority-score. - 2. Classifier confidence on catalog directories. Only 1 of 9 audited reputable sites was classified as
programmatic-directoryat ≥70% confidence. Improving the classifier is the largest remaining lever inside the engine itself. - 3. A weak-pSEO calibration corpus to balance the reputable one. Future runs would assert reputable-PASS and weak-FLAG simultaneously, preventing single-axis drift.
- 4. Core Web Vitals (LCP/INP/CLS) via render-mode page-load instrumentation. Currently in tier-1 of the blind-spot audit; needs a render-time PerformanceObserver harness.
- 5. Residential-IP rendering proxy to add Zillow, Yelp, TripAdvisor, Indeed (currently excluded for CAPTCHA/403 walls).
- 6. Verdict-explanation strings anchored to outcomes — "concerning means at this risk level, sites historically saw X% indexation loss after a Helpful Content rebuild" — once we have enough longitudinal data.
Verify any of this yourself
The corpus, the runner, and every per-round result are committed to the public repository. Clone it, run bun run scripts/calibration-reputable-pseo.ts, and confirm or refute the table above. We treat the runner's output as the source of truth — this page is just a dated mirror.