Why use a 95% threshold here when near-duplicate fires at 85%?

Because the expectation is different. Two unrelated articles at 85% similarity are suspiciously alike, but a page and its translated alternate should share almost no tokens once one is genuinely in another language — a real French translation of an English page lands well under 50% SimHash similarity. So the bar is raised to 95% to fire only on pairs that are near-identical, which for a locale pair is overwhelming evidence the body was never translated. It keeps the false-positive rate near zero on sites that do localise properly.

Our online chess school has /en/ and /fr/ lesson pages — when will this rule flag us?

Only when a /fr/ lesson body is still essentially the English one after content extraction. If your /fr/blitz-tactics page genuinely teaches zugzwang and rook endgames in French — different words, different examples, a localised FIDE-rating ladder — the SimHash between it and /en/blitz-tactics drops far below 95% and the rule stays silent. It fires when a build shipped, say, 18 locale folders but the translation service only completed 5, leaving the other 13 as English bodies under foreign hreflang tags. In one illustrative cleanup a school closed that gap across 13 folders within 6 weeks and recovered roughly 31% of lost non-English impressions.

Does swapping the page title and breadcrumbs into French satisfy the rule?

No, and that is the most common false fix. The SimHash is computed on the extracted main content body, not the chrome — so translating the title, nav, and breadcrumbs while leaving a 1,400-word English article untouched still lands above the 95% threshold and still fires. The rule is measuring whether the substance was translated, not the wrapper. Translate the article itself; a localised header over English prose is exactly the fake-i18n pattern the rule exists to catch.

What if a locale page is short, like a 12-word stub?

It is skipped. The rule enforces a 30-word minimum body floor per cluster: if every variant in a base-path group falls below 30 words, pairwise SimHash similarity collapses toward 100% for trivial reasons that have nothing to do with translation, so the cluster is ignored here. Those near-empty pages are a thin-content problem instead, and the engine surfaces them through spam/thin-content. A cluster still evaluates if at least one variant clears the floor, since one full English page against a thin locale stub genuinely is a translation gap worth flagging.

How does the rule decide which URLs belong to the same locale group?

It strips the leading locale segment from each path and groups by what remains. A regular expression matches a two-letter language code or a language-region pair at the start of the path — /en/, /fr/, /pt-br/ — and removes it, so /en/openings and /fr/openings both reduce to the base path /openings and land in the same bucket. URLs without a recognised locale prefix are never grouped, and a base path with only one locale variant is skipped because a single locale cannot be a translation no-op. Only buckets with two or more locale variants are compared.

Rule referencecontent/translation-no-op

Translation No-Op — Locale Folders That Were Never Actually Translated

content/translation-no-op groups URLs that differ only by a leading locale segment like /en/ or /fr/, computes a 64-bit SimHash of each extracted body, and fires an error the moment any pair scores at or above 95% similarity — the fake-i18n pattern Google has told site owners to fix with real hreflang pairs, not duplicated English.

Test this rule on your site →Run a full audit

Test your site for translation no-op — locale folders that were never actually translated

What it detects

content/translation-no-op catches a specific failure of programmatic internationalisation: a site ships /en/, /fr/, /de/ folders that look multilingual in the URL but carry the same untranslated body on every locale.

The rule reads each page's path and matches a leading locale segment with a regular expression covering two-letter codes and region variants — /en/, /fr/, /it/, /fr-ca/. Pages without a locale prefix are skipped. It strips that segment to a base path so /en/openings and /fr/openings both collapse to /openings, then buckets every locale variant under that shared base path. A bucket with fewer than two members is ignored, because one lone locale is not a translation problem.

Within each bucket it computes a 64-bit SimHash from the extracted main content text, measures Hamming distance between every variant pair, and converts that distance to a similarity score in [0,1]. If any pair scores at or above the 0.95 threshold, the rule emits one error per cluster naming the locale count, the base path, and the exact similarity percentage so you can see how identical the variants really are.

Why it matters

An untranslated locale folder is worse than no locale folder at all. You have paid the full engineering cost of a multilingual URL structure and an hreflang setup, then handed search engines two or more URLs whose bodies are byte-for-byte the same — so the hreflang annotations point at pages that are not actually alternates, and Google falls back to picking one canonical and discounting the rest.

Google's own internationalisation guidance is blunt about this: hreflang exists to connect genuinely translated or regionally-adapted versions, and shipping the source language under a foreign locale tag is a known anti-pattern that wastes crawl budget and confuses the canonical signal. A /fr/ page that is 100% English is not a French page; it is a duplicate wearing a locale costume.

At scale the harm compounds. A template that generates 30 locale folders but only translates 3 of them produces 27 folders of duplicated source-language content, which reads to a classifier exactly like scaled duplication. The error severity here reflects that: this is not a soft suggestion but a structural defect that breaks the one promise a locale URL makes.

A page that fails

An international chess federation ships /en/openings/sicilian-najdorf and /fr/openings/sicilian-najdorf, both serving the same 1,400-word English explainer on the Najdorf gambit — knight to f6, the poisoned-pawn line, the typical rook lift, and the endgame plans. The /fr/ URL carries a French hreflang tag but not one translated sentence; after content extraction the two bodies hit 0.98 SimHash similarity. The rule groups the two locale variants of /openings/sicilian-najdorf and fires error: both share identical content at 98%, so translate the body or consolidate to the canonical version.

A page that passes

The same federation actually translates the page. /en/openings/sicilian-najdorf keeps the English Najdorf walkthrough; /fr/openings/sicilian-najdorf is rewritten in French — la variante Najdorf, le pion empoisonné, le plan de finale — with FIDE-rating context and tournament-pairing examples localised for francophone players. After extraction the two bodies share almost no token shingles and SimHash similarity falls to 0.21, far below the 95% floor. The rule stays silent, the hreflang pair now connects two genuinely distinct translations, and each locale ranks for searchers in its own language.

How to fix it

1Translate the body for real, not just the title and nav — the SimHash is computed on extracted main content, so a translated heading over an English article still trips the rule at 95%.
2If a locale was never meant to ship, delete the untranslated folder and remove its hreflang entry rather than leaving a duplicate live under a foreign tag.
3Where you genuinely cannot translate yet, redirect every untranslated locale variant to the canonical URL and keep hreflang only on the canonical until real translations exist.
4Audit your i18n pipeline for partial coverage: a template that translated 4 of 12 locales leaves 8 folders of duplicated source language that this rule will flag cluster by cluster.
5Re-run after each translation pass — the rule fires once per cluster of near-identical variants, so clearing one base path does not silence the others until their bodies actually diverge.

SpamBrain context

Duplicated locale folders are a clean scaled-content tell because they are almost always machine-generated: a build step stamps out /en/, /fr/, /es/ folders from one source template and the translation job either fails silently or was never wired up. The March 5, 2024 scaled-content-abuse policy treats mass production of low-value pages as a violation independent of intent, and 27 untranslated locale folders are 27 pages a script produced without adding a word of value.

content/translation-no-op (in @pseolint/core, MIT-licensed at github.com/ouranos-labs/pseolint) deliberately reuses the same SimHash machinery spam/near-duplicate runs on, but scopes it to locale-prefixed URL pairs and raises the bar to 0.95 — far stricter than the general near-duplicate ceiling — because two locale variants of the same page should be wildly different if either was translated at all. A 95% match between an English page and its French alternate is near-conclusive proof the translation never happened.

The rule also enforces a 30-word minimum body floor before it evaluates a cluster. Below that floor near-empty pages collapse to ~100% similarity for trivial reasons, and the real defect is thin content, not a translation no-op — so the engine routes those to spam/thin-content instead and keeps this rule's findings honest.

Frequently asked questions

Why use a 95% threshold here when near-duplicate fires at 85%?: Because the expectation is different. Two unrelated articles at 85% similarity are suspiciously alike, but a page and its translated alternate should share almost no tokens once one is genuinely in another language — a real French translation of an English page lands well under 50% SimHash similarity. So the bar is raised to 95% to fire only on pairs that are near-identical, which for a locale pair is overwhelming evidence the body was never translated. It keeps the false-positive rate near zero on sites that do localise properly.
Our online chess school has /en/ and /fr/ lesson pages — when will this rule flag us?: Only when a /fr/ lesson body is still essentially the English one after content extraction. If your /fr/blitz-tactics page genuinely teaches zugzwang and rook endgames in French — different words, different examples, a localised FIDE-rating ladder — the SimHash between it and /en/blitz-tactics drops far below 95% and the rule stays silent. It fires when a build shipped, say, 18 locale folders but the translation service only completed 5, leaving the other 13 as English bodies under foreign hreflang tags. In one illustrative cleanup a school closed that gap across 13 folders within 6 weeks and recovered roughly 31% of lost non-English impressions.
Does swapping the page title and breadcrumbs into French satisfy the rule?: No, and that is the most common false fix. The SimHash is computed on the extracted main content body, not the chrome — so translating the title, nav, and breadcrumbs while leaving a 1,400-word English article untouched still lands above the 95% threshold and still fires. The rule is measuring whether the substance was translated, not the wrapper. Translate the article itself; a localised header over English prose is exactly the fake-i18n pattern the rule exists to catch.
What if a locale page is short, like a 12-word stub?: It is skipped. The rule enforces a 30-word minimum body floor per cluster: if every variant in a base-path group falls below 30 words, pairwise SimHash similarity collapses toward 100% for trivial reasons that have nothing to do with translation, so the cluster is ignored here. Those near-empty pages are a thin-content problem instead, and the engine surfaces them through spam/thin-content. A cluster still evaluates if at least one variant clears the floor, since one full English page against a thin locale stub genuinely is a translation gap worth flagging.
How does the rule decide which URLs belong to the same locale group?: It strips the leading locale segment from each path and groups by what remains. A regular expression matches a two-letter language code or a language-region pair at the start of the path — /en/, /fr/, /pt-br/ — and removes it, so /en/openings and /fr/openings both reduce to the base path /openings and land in the same bucket. URLs without a recognised locale prefix are never grouped, and a base path with only one locale variant is skipped because a single locale cannot be a translation no-op. Only buckets with two or more locale variants are compared.

Sources

Google Search Central — Tell Google about localized versions (hreflang) — Hreflang is intended to route searchers to a genuinely localised version of a page — content/translation-no-op exposes when that intent is absent: locale folders like /fr/ or /fr-ca/ that score 95% or higher SimHash similarity against their /en/ counterpart are duplicate English masquerading as translated content, making hreflang annotations misleading.
Google Search Central — Spam policies: scaled content abuse — Scaled-content-abuse policy covers any high-volume production method that yields pages with little added value; locale-folder duplication is a scaled pattern — the same body republished under N language prefixes at once — and the 95% SimHash ceiling is the rule's operationalisation of that low-value threshold.
Google Search Central — Creating helpful, reliable, people-first content — People-first guidance requires that content serve the actual audience of the page; a /de/ URL whose body is byte-for-byte English fails that test, and content/translation-no-op's SimHash check surfaces those failures before Googlebot assigns them to the wrong regional audience.

Related rules

Want to know whether this rule actually fires on your site?

Run pseolint against your sitemap. The audit is free, takes about a minute, and returns a per-URL list of every rule that fired — including this one — with the exact metric values so you can prioritise the fix queue.

Open the spambrain checker All rules