Does a 40% overlap mean Google will penalise my page?

No. The 40% threshold is pseolint's heuristic for 'this reads like reworded Wikipedia', not a Google penalty line. The rule fires at warning severity and low confidence precisely because trigram overlap is suggestive, not conclusive. A high overlap means a searcher likely gains nothing from your page over the encyclopedia itself, so it is worth rewriting — but it is a prompt to look, not proof of plagiarism or a manual action.

Why use trigram overlap instead of a real plagiarism checker?

A full plagiarism check would compare against the live web and cost far more than a 60-second audit can spend. Trigrams — sliding three-word windows — checked against a bundled bloom filter run in milliseconds with no network call and roughly a 5% false-positive rate. It is a cheap, deterministic proxy: it catches the shape of reworded reference text without the expense of a true semantic comparison, which is the right trade-off for a fast standalone signal.

What is the bundled Wikipedia corpus and what does it miss?

It is a curated set of about 10,000 Wikipedia trigrams stored as an 8-kilobyte bloom filter inlined in the engine, so the rule works in any runtime with no filesystem or network dependency. Because it is finite and bundled, it does not cover all of Wikipedia — a page reworded from an article outside the corpus may score low and pass. The rule is a low-confidence net for common encyclopedic phrasing, not an exhaustive copy detector.

I run a fossil-collecting site about real species — won't every page trip this?

Only if you rework the encyclopedia. Naming a Cretaceous trilobite or describing theropod anatomy in common phrasing can nudge the overlap up, but the threshold is 40%, and a page built from your own field notes clears it easily. The cure is first-hand substance: the locality you collected at, the sediment layer and matrix, your prep method, the measured size of your specimen. That material exists on no Wikipedia page, so it drives overlap down regardless of how encyclopedic the species is.

How is this different from the regurgitated-content rule?

content/regurgitated-content asks whether your pages repeat each other; content/wikipedia-paraphrase asks whether your page repeats Wikipedia. They are orthogonal: a site can pass every internal-duplication check and still be a thin gloss over the encyclopedia on every URL, because that overlap is with an external source the other rules never compare against. Run both — one guards against self-duplication, the other against reworded reference material.

Rule referencecontent/wikipedia-paraphrase

Wikipedia Paraphrase — When Your Page Is Just the Encyclopedia, Reworded

content/wikipedia-paraphrase fires a low-confidence warning the moment a page shares 40% or more of its three-word phrases with a bundled Wikipedia reference corpus, the trigram-overlap point at which Google's helpful-content framing reads a URL as reworded encyclopedia rather than the original analysis a March 2024 audit rewards.

Test this rule on your site →Run a full audit

Test your site for wikipedia paraphrase — when your page is just the encyclopedia, reworded

What it detects

content/wikipedia-paraphrase asks one narrow question of each page: how much of your prose is just Wikipedia, lightly reworded? The rule tokenises the main content text — lower-cased, punctuation stripped, split on whitespace — and slides a three-word window across it to produce a list of trigrams. Each trigram is checked against a bundled Wikipedia reference corpus stored as a compact bloom filter (65,536 bits, 3 FNV-1a hash functions, roughly a 5% false-positive rate over about 10,000 curated trigrams). The paraphrase rate is the fraction of a page's trigrams that hit the corpus.

When that rate reaches the 0.40 threshold, the rule emits one finding per qualifying page at warning severity and low confidence, reporting the exact overlap percentage so you can sort worst-first. Pages with fewer than three tokens score zero and are skipped. The framing is deliberate: paraphrased encyclopedic content adds nothing original to the web, so a page that is 40% recycled Wikipedia phrasing is, for ranking purposes, a page that already exists.

The heuristic is honest about its limits — it is a low-confidence signal precisely because the corpus is bundled and finite, and a page about a genuinely encyclopedic subject can share common phrasing without copying anything.

Why it matters

A page can be accurate, well-written, and completely worthless to search at the same time. If everything it says is the Wikipedia article on the subject rephrased, it earns no slot of its own — Google already indexes the source, and the reworded copy adds nothing a searcher could not get upstream. That is exactly the 'made to help search engines, not people' shape the helpful-content framing targets, and recycled encyclopedic prose is one of its cleanest tells.

This rule is orthogonal to content/regurgitated-content. That rule asks whether your pages repeat each other; this one asks whether your page repeats the encyclopedia. A site can pass every internal-duplication check and still be a thin gloss over Wikipedia on every URL — the overlap is with an external source the other rules never see. A 40% trigram match does not prove plagiarism, and the rule never claims it does; it claims the page reads like the encyclopedia, and asks you to look.

The cost of ignoring it is slow. Reworded-reference pages rarely trigger a hard action; they simply never rank, sitting unseen for 6 months while you wonder why traffic flatlined. The fix — replace borrowed phrasing with first-hand observation — is also what makes the page worth visiting.

A page that fails

An amateur paleontology site publishes /fossils/ammonite as a 700-word page. The opening 300 words are the Wikipedia 'Ammonite' article reworded: the Devonian-to-Cretaceous range, the chambered shell and siphuncle, the suture-line classification, all rephrased sentence by sentence with no first-hand content. The trigram check returns a 47% overlap against the bundled corpus and the rule fires a warning: the page is the encyclopedia in different words, so a searcher gains nothing by clicking it over Wikipedia itself.

A page that passes

The same /fossils/ammonite page, rewritten from the collector's own field notes. It opens with the specific roadcut where the author pulled three ammonites from a grey shale sediment layer over 2 weekends, the exact matrix hardness that needed an air-scribe to prep, the iridescent nacre that survived on one specimen and not the others, and a measured 84-millimetre diameter with a photo scale. Encyclopedic background drops to two linked sentences. Trigram overlap falls to 12%, well under the 40% threshold, and the page clears — because almost none of it exists on Wikipedia.

How to fix it

1Lead with first-hand observation the encyclopedia cannot have — the dig site, the exact sediment layer, the prep tools, the measured dimensions of your actual specimen.
2Replace reworded background with two or three linked sentences, then send the reader to Wikipedia for the textbook taxonomy rather than rephrasing it on your page.
3Add page-specific facts that exist nowhere else: your matrix-removal technique, the failed prep that cracked a trilobite, the locality coordinates, the date you collected it.
4Photograph and describe your own material. A theropod tooth you found, scaled and lit, is content no corpus contains; a reworded description of theropod dentition is not.
5Re-run the audit and sort by overlap percentage. Clear pages above 45% first — those are almost entirely reference text and need the most original substance grafted in.
6Treat the warning as a prompt, not a verdict. On a legitimately encyclopedic topic the heuristic can over-fire, so confirm the page actually reads as reworded Wikipedia before rewriting it.

SpamBrain context

Google's quality systems have penalised 'no added value' content for over a decade — copied or thinly-reworded reference material has been a Lowest-quality marker in the Search Quality Rater Guidelines since long before the helpful-content era — and the March 5, 2024 scaled-content-abuse update made it enforceable at scale by naming pages with 'little unique value' regardless of how they were produced. A page that is 40% reworded Wikipedia is the textbook case: useful information, zero originality.

content/wikipedia-paraphrase (in @pseolint/core, MIT-licensed at github.com/ouranos-labs/pseolint) is a deliberately standalone originality signal. Where spam/near-duplicate compares your pages against each other with SimHash and content/unique-value counts page-exclusive vocabulary within the audit, this rule reaches outside the crawl entirely, comparing each page's trigrams against a bundled Wikipedia corpus. That external reach is what makes it orthogonal — and what makes it low confidence. The corpus is finite and bundled, so it cannot see every Wikipedia article, and trigram overlap measures phrasing, not intent.

The rule cannot tell paraphrase from coincidence with certainty. It flags pages that statistically read like reworded encyclopedia and asks you to judge whether they are — which is why it ships as a warning at low confidence, never as an error.

Frequently asked questions

Does a 40% overlap mean Google will penalise my page?: No. The 40% threshold is pseolint's heuristic for 'this reads like reworded Wikipedia', not a Google penalty line. The rule fires at warning severity and low confidence precisely because trigram overlap is suggestive, not conclusive. A high overlap means a searcher likely gains nothing from your page over the encyclopedia itself, so it is worth rewriting — but it is a prompt to look, not proof of plagiarism or a manual action.
Why use trigram overlap instead of a real plagiarism checker?: A full plagiarism check would compare against the live web and cost far more than a 60-second audit can spend. Trigrams — sliding three-word windows — checked against a bundled bloom filter run in milliseconds with no network call and roughly a 5% false-positive rate. It is a cheap, deterministic proxy: it catches the shape of reworded reference text without the expense of a true semantic comparison, which is the right trade-off for a fast standalone signal.
What is the bundled Wikipedia corpus and what does it miss?: It is a curated set of about 10,000 Wikipedia trigrams stored as an 8-kilobyte bloom filter inlined in the engine, so the rule works in any runtime with no filesystem or network dependency. Because it is finite and bundled, it does not cover all of Wikipedia — a page reworded from an article outside the corpus may score low and pass. The rule is a low-confidence net for common encyclopedic phrasing, not an exhaustive copy detector.
I run a fossil-collecting site about real species — won't every page trip this?: Only if you rework the encyclopedia. Naming a Cretaceous trilobite or describing theropod anatomy in common phrasing can nudge the overlap up, but the threshold is 40%, and a page built from your own field notes clears it easily. The cure is first-hand substance: the locality you collected at, the sediment layer and matrix, your prep method, the measured size of your specimen. That material exists on no Wikipedia page, so it drives overlap down regardless of how encyclopedic the species is.
How is this different from the regurgitated-content rule?: content/regurgitated-content asks whether your pages repeat each other; content/wikipedia-paraphrase asks whether your page repeats Wikipedia. They are orthogonal: a site can pass every internal-duplication check and still be a thin gloss over the encyclopedia on every URL, because that overlap is with an external source the other rules never compare against. Run both — one guards against self-duplication, the other against reworded reference material.

Sources

Google Search Central — Spam policies: scaled content abuse — Scaled-content-abuse policy explicitly names republishing existing information as a production method that yields little added value; content/wikipedia-paraphrase operationalises that risk at a 40% trigram-overlap threshold against a bloom-filter-encoded reference corpus, the point at which lightly reworded encyclopedia prose becomes statistically distinguishable from original analysis.
Google Search Central — Creating helpful, reliable, people-first content — People-first guidance asks whether a page delivers insights or synthesis that readers could not easily find elsewhere; a 40%-trigram match against the Wikipedia reference corpus — checked via a 65,536-bit bloom filter with 3 FNV-1a hashes — is the rule's operationalisation of that originality test.
Schema.org — full hierarchy of structured-data types — Schema.org vocabulary — Article, DefinedTerm, AboutPage — signals topical authority that Wikipedia paraphrase directly undercuts; content/wikipedia-paraphrase surfaces the trigram-overlap gap so publishers can decide whether to add structured authorship and sourcing before deploying entity-definition pages at scale.

Related rules

Want to know whether this rule actually fires on your site?

Run pseolint against your sitemap. The audit is free, takes about a minute, and returns a per-URL list of every rule that fired — including this one — with the exact metric values so you can prioritise the fix queue.

Open the thin content scanner All rules