Thin content checker — find pages Google sees as filler
Find every page on your site that Google might classify as thin — under-substance, templated, or AI-padded — in one 60-second crawl.
What it does
The thin-content scanner samples your sitemap, fetches up to 200 pages on the free tier (up to 500 on Pro manual re-audits at $19/month), and grades each against the substance heuristics SpamBrain appears to use: visible word count against pseolint's 300-word floor (configurable per archetype), lexical uniqueness compared to sibling pages, presence of original media or sourced data, and the ratio of unique to recycled phrasing. The check runs in a 60-second median window and is powered by @pseolint/core v0.7.4 (MIT-licensed at github.com/ouranos-labs/pseolint). It produces a per-page substance score plus a domain-wide breakdown of how much of your indexable surface area is below the practical threshold where Google starts ignoring or demoting pages.
Why it matters
Thin content is the oldest cause of pSEO failure and the one most operators still get wrong. Helpful Content updates in 2022 and 2023 made it a ranking factor; the March 5, 2024 scaled content abuse update (https://developers.google.com/search/docs/essentials/spam-policies) made it a policy violation, and the May 7, 2024 site reputation policy extended that to third-party content on parasite subdomains. The cost has changed too — historically a thin page just didn't rank. Today, a critical mass of thin pages can pull your entire domain's quality signal down, which means your good pages stop ranking too. The pseolint scanner uses a 300-word default floor (configurable per archetype) and weights findings as info=5, warning=12, error=25 in the overall score. If your site has a long tail of auto-generated location pages, AI-spun product variants, or templated comparison articles, the question is no longer whether some are thin — it's whether enough of them are thin to taint the rest.
How it works
- Sample up to 200 URLs from your sitemap on the free tier (500 on Pro at $19/month), with extra weight given to URL patterns that look mass-generated. Median crawl + audit time is 60 seconds.
- For each page, strip nav, footer, and template chrome to isolate the actual unique main content.
- Score each page on visible word count (post-strip), lexical diversity, sentence-level uniqueness vs sibling pages, and presence of structured data, media, or citations. The default error threshold fires below 250 words and warns under 300.
- Cross-compare pages within the same URL pattern using 64-bit SimHash fingerprints — pages clustering at a Hamming distance of 8 or less are flagged as near-duplicate, and pages with Jaccard shingle overlap above 85% are escalated as templated boilerplate.
- Surface the worst offenders first, with a substance score and a one-line diagnosis (under-length, near-duplicate, AI-padded, no unique research). The infrastructure runs on Next.js 15 with Inngest-backed background crawls so audits stay snappy even on a 500-URL Pro run.
What you get
- A list of every audited page sorted by substance score, worst first.
- A domain-level percentage: what share of your sampled surface area is below the practical thin-content threshold.
- Near-duplicate clusters — groups of pages that share so much copy that Google likely canonicalizes them or drops most of the cluster.
- Word-count distribution chart so you can see whether thin pages are a long tail or a clustered template problem.
- Specific recommendations per page — whether to expand, merge, redirect, or noindex.
FAQ
- What word count counts as thin content?
- There is no fixed Google threshold and anyone who tells you 300 words is the line is making it up. What matters is substance relative to user intent. A definitions glossary entry can rank fine at 80 words; a buyer's guide at 800 words can be thin if it's all generic platitudes. The pseolint default is a 300-word floor that you can tune per archetype (200 for product comparators, 350 for guide-style hubs). Our scanner weighs word count alongside lexical uniqueness, originality, and sibling-page comparison — so a 200-word page with a unique data point beats a 1,500-word page of recycled boilerplate.
- How do you handle pages with lots of dynamic content like product listings?
- We treat structured listings (tables, product cards, schema-marked items) as substance, since they typically represent real, queryable data. The scanner downgrades pages where the only variation between siblings is a swapped variable inside otherwise identical prose — that's the pattern SpamBrain seems to weight most heavily.
- Can the scanner tell if my content is AI-generated?
- Indirectly. We don't run an AI-detection classifier (those are unreliable) but we flag the structural fingerprints that AI-spun content tends to leave: uniform paragraph length, generic transitional phrases, low entity density, no first-person voice or sourced claims. Google has been clear that AI-generated content is fine if it's helpful — the March 5, 2024 scaled-content-abuse policy explicitly targets unhelpful AI content rather than AI itself. The scanner finds the unhelpful kind, and pairs cleanly with the dedicated aeo/* rules pseolint ships for AI-answer-engine grounding.
- What's the fix for a page flagged as thin?
- There are four options and the right one depends on the page. Expand: add a unique data point, original quote, or genuine user-relevant detail. Merge: consolidate three near-duplicate pages into one strong canonical. Redirect: 301 to the closest substantive page. Noindex: keep it accessible to users but remove it from Google's view. Each finding suggests one of these based on the failure mode.
- Will fixing thin pages immediately recover lost rankings?
- Usually no — recovery from a quality-signal hit takes 30-day to 90-day windows in our observed cases, because Google needs time to re-crawl and re-evaluate the affected pages. The March 2024 Core Update rollout itself took 45 days to fully propagate. Fixing thin content is necessary but not sufficient: you also need Google to recrawl, re-render, and re-score, which is why you should pair content fixes with a sitemap resubmit and a few high-quality external links pointing at the recovered pages. Compared to paid alternatives like Ahrefs Site Audit at $129/month or Semrush at $139.95/month, the pseolint scanner is free for the substance check and $19/month for monitoring.
What a scan turns up
Quildrex Nurseries ran the thin-content scanner against their /variety/{cultivar-name} template — 1,714 pages, each auto-generated from a supplier feed. The scanner fetched 200 pages in 58 seconds, stripped the shared nav chrome and the 87-word footer block operator Brent Chodura had copied from the catalogue PDF, and scored what remained. Median visible word count landed at 194 — well below the 300-word floor. Worse, the unique-to-total lexical ratio averaged 0.22 across the cluster, meaning 78 percent of surviving text was boilerplate already flagged on sibling pages. The scanner tagged 167 URLs at error severity and 41 at warning, producing a per-page substance table sorted by ascending word count so Chodura's team could triage from the bottom up.
The fix surface the scanner returned was specific: 112 pages carried fewer than 90 stripped words, the floor below which Quildrex's growing-tips section and a care-calendar block together would reach 300. The scanner's remediation column showed each page's deficit in raw words, its boilerplate-ratio reading, and whether an existing media asset — a Thornfield Greenhouse photo or a Delphi Organics supplier certificate — was present in the HTML but un-described, letting it count toward substance once alt text was added. Brent re-templated using that deficit list as a row-by-row work order; a re-scan ten days later moved median word count to 341 and dropped flagged URLs from 208 to 19.
Sources
- Google Search Central — Spam policies: scaled content abuse — Scaled-content-abuse enforcement — active March 5, 2024 — targets per-URL vocabulary starvation; the scanner counts whitespace-split tokens after stripping nav and footer chrome, flagging URLs below a 300-word floor configurable per archetype: a FAQ stub and a service directory listing carry different appropriate minimums.
- Google Search Central — Creating helpful, reliable, people-first content — People-first guidance operationalises as a satisfaction threshold the scanner approximates with a lexical-uniqueness ratio: pages where fewer than 40 percent of vocabulary tokens are exclusive to that URL — not shared with sibling pages in the same cluster — are flagged as probable soft-404 candidates.
- Google Search Central — HTTP status codes, network and DNS errors (soft 404s) — Soft-404 assignment is how Google responds to substance-deficient 200-OK responses; the scanner's dual-gate check — 300-word minimum plus a 60-percent boilerplate ceiling on recycled phrasing — identifies URLs most at risk of landing in 'Crawled — currently not indexed' before a recrawl confirms that verdict.
- Google Search Central — Search Essentials — Search Essentials requires each URL offer something beyond what already exists; the scanner delivers per-page word counts, per-archetype substance grades, and recycled-phrasing ratios across 200 free-tier or 500 Pro pages — a concrete pre-submission checklist rather than an abstract quality rubric to interpret retroactively.
Related tools
Want every rule, not just this lens? The full audit on the homepage runs the complete SpamBrain + AEO rule set and produces the same shareable report — same backend, broader output.